Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions

Consider the regularized sparse minimization problem, which involves empirical sums of loss functions for $n$ data points (each of dimension $d$) and a nonconvex sparsity penalty. We prove that finding an $\mathcal{O}(n^{c_1}d^{c_2})$-optimal solutio…

Authors: Yichen Chen, Dongdong Ge, Mengdi Wang

Str ong NP-Hardness f or Sparse Optimization with Conca v e Pen alty Functions Y ichen Chen 1 Dongdong Ge 2 Mengdi W ang 1 Zizhuo W ang 3 Y iny u Y e 4 Hao Y in 4 Abstract Consider the regularized sparse minimization problem , which inv olves empirical sums of loss function s for n data points (each of dimen sion d ) and a non conv ex sparsity penalty . W e prove that finding an O ( n c 1 d c 2 ) -optimal solution to the regularized sparse optimization prob lem is strongly NP-hard for any c 1 , c 2 ∈ [0 , 1) such that c 1 + c 2 < 1 . The result ap plies to a broad class of loss fun ctions an d sparse penalty functio n s. It sug gests that o ne cann ot even appro ximately solve the sparse optimization prob lem in polyno- mial time, u n less P = NP . Keywords: Nonconve x optimization · Compu tational complexity · NP-hardn ess · Con cave pen a lty · Sparsity 1 Introd uction W e study the sparse minimization problem, where the ob- jectiv e is the sum o f empir ical losses over input data and a sparse p enalty functio n . Su ch p roblems comm only a rise from empirical risk minimization and variable selection. The r ole of the penalty function is to ind u ce spar sity in the optimal solu tion, i.e., to minimize th e empir ic a l loss using as f ew nonzero coefficients as possible. Problem 1 Given the lo ss function ℓ : R × R 7→ R + , penalty fu nction p : R 7→ R + , and regulariza tio n parameter λ > 0 , con sider the problem min x ∈ R d n X i =1 ℓ  a T i x, b i  + λ d X j =1 p ( | x j | ) , where A = ( a 1 , . . . , a n ) T ∈ R n × d , b = ( b 1 , . . . , b n ) T ∈ R n are in put d ata. 1 Princeton Univ ersit y , NJ, USA 2 Shanghai Univ ersi ty of F i- nance and Economics, Shanghai, China 3 Univ ersity of Minnesota, MN, US A 4 Stanford Univ ersity , CA, US A. Correspondence to: Mengdi W ang < mengdiw@princeton.edu > . Pr oceeding s of the 34 th International Confer ence on Machine Learning , Sydney , Australia, PMLR 70, 2017. Copyright 2017 by the author(s). W e ar e interested in th e compu tational complexity of Prob- lem 1 under g eneral co nditions of th e lo ss fun ction ℓ and the sparse penalty p . In particular, we focus on the case where ℓ is a convex loss functio n and p is a concave p enalty with a uniqu e min imizer at 0 . Optim iza tion p roblems with conv ex ℓ and concave p are common in sparse regression, compressive sensing , and spa r se appr oximation . A list o f applicable exam p les of ℓ and p is giv en in Section 3 . For certain special cases of Problem 1, it has been shown th at finding an exact so lution is strongly NP-har d ( Huo & Chen , 201 0 ; Chen et al. , 2014 ). Howe ver, th e se results have not excluded the possibility of the existence of p olyno mial-time algorithm s with small appro ximation error . ( Chen & W ang , 2 016 ) established th e h ardness of approx imately so lving Problem 1 when p is the L 0 norm. In this p a per, we prove that it is strongly NP-h ard to ap- proxim a tely so lve Prob lem 1 within certain o ptimality er- ror . More precisely , we show that th ere exists a lower bound on the suboptim ality erro r of any po ly nomial-tim e deterministic algor ithm. Our results apply to a variety of optimization prob lems in estimation and machine learn ing. Examples includ e sparse classification, sparse logistic re- gression, and m any more. The strong NP-hardness of ap- proxim a tion is on e of the stron g est forms of co mplexity result fo r co ntinuou s o ptimization. T o our b est knowledge, this paper gives the first an d stron gest set of hardn ess re- sults for Problem 1 under very gen eral assump tions regard - ing th e loss and penalty functions. Our m ain co ntributions are thr ee-fold. 1. W e prove the strong NP-hardn ess for Pr oblem 1 with general loss f unctions. This is the first results that apply to the broad class of problem s in c lu ding but not limited to : least squar es regression, linear m odel with Lap lacian no ise, r obust regression, Poisson re- gression, logistic regression, inverse Gaussian mode ls, etc. 2. W e present a gen eral condition on the sparse penalty function p such that Prob le m 1 is strongly NP-hard. The conditio n is a slight weaker version of strict con- cavity . It is satisfied by typical penalty function s such as the L q norm ( q ∈ [0 , 1) ), clipped L 1 norm, SCAD, etc. T o the best of our k n owledge, this is th e mo st ge n - Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions eral cond ition o n th e penalty function in the literatur e. 3. W e prove that fin ding an O ( λn c 1 d c 2 ) -optimal solu- tion to Problem 1 is strongly NP- h ard, fo r any c 1 , c 2 ∈ [0 , 1) such that c 1 + c 2 < 1 . Her e the O ( · ) hides pa- rameters that depend o n the p enalty fu n ction p , wh ich is to be specified later . It illu strates a gap b etween the optimization erro r achieved by any tra ctable alg o- rithm and the d esired statistical precision. Our proof provides a first unified analysis th at deals with a br oad class o f pr oblems takin g the form of Prob le m 1. Section 2 summarizes related literatu res from optimiza tion, machine learn ing and statistics. Section 3 presents the key assumptions an d illustra tes examples o f loss and penalty function s that satis fy the assumptions. Section 4 gi ves the main resu lts. Section 5 discusses the imp lications of our hardne ss r esults. Section 6 provides a proof o f the main results in a simplified setting. The full pr oofs are deferred to the ap pendix. 2 Backgrou nd and Related W orks Sparse op timization is a powerful mac h ine learnin g tool for extracting usefu l info rmation for massive data. In Prob lem 1, the spar se p enalty serves to select the mo st relev an t vari- ables fro m a large n umber of variables, in o rder to av oid overfitting. In r ecent years, non conv ex cho ices of p have received much attention; see ( Frank & Friedman , 1993 ; Fan & Li , 20 01 ; Chartrand , 200 7 ; Candes et al. , 2 008 ; Fan & Lv , 2010 ; Xue et al. , 20 12 ; Lo h & W ainwright , 2013 ; W ang et al. , 2014 ; Fan et al. , 2 015 ). W ithin th e o p timization and mathem atical pro grammin g commun ity , the co mplexity of Problem 1 h as been con sid- ered in a n u mber of special cases. ( Huo & Chen , 2 010 ) first proved the hardn ess result for a relaxed family of pena lty function s with L 2 loss. They show that fo r the penalties in L 0 , hard- thresholde d ( Antoniadis & Fan , 2001 ) and SCAD ( Fan & Li , 2001 ), the above optimizatio n pr o blem is NP- hard. ( Chen et al. , 2014 ) showed that th e L 2 - L p minimiza- tion is strongly NP-hard when p ∈ (0 , 1) . At the same time, ( Bian & Chen , 2014 ) proved the strongly NP-h ardness for another class of penalty fu nctions. The preceding existing analyses mainly fo cused on find ing an exact global opti- mum to Problem 1. For this purpo se, th ey implicitly as- sumed that all th e in put and para meters in volved in th e re- duction are rational nu mbers with a finite n u merical repre- sentation, otherwise findin g a g lobal op timum to a contin- uous problem would be always in tractable. A recent tech - nical report ( Chen & W ang , 2016 ) proves the hardn ess of obtaining an ǫ - optimal solution when p is the L 0 norm. W ithin the theoretical comp uter science commun ity , there have been several early works on the complex- ity o f sparse recovery , beginning with ( Arora et al. , 1993 ). ( Amaldi & Kann , 19 98 ) proved that the prob lem min {k x k 0 | Ax = b } is not app roximab le within a fac- tor 2 log 1 − ǫ d for any ǫ > 0 . ( Natarajan , 19 95 ) showed that, gi ven ǫ > 0 , A and b , the problem min {k x k 0 | k Ax − b k 2 ≤ ǫ } is NP-hard . ( Davis et al. , 1 997 ) proved a similar result tha t fo r some given ǫ > 0 and M > 0 , it is NP-complete to find a solutio n x such that k x k 0 ≤ M and k Ax − b k ≤ ǫ . Mo re recently , ( Foster et al. , 2015 ) studied sparse recovery an d sparse linear regre ssion with sub gaus- sian n oises. Assuming that the tru e solu tion is K -sparse, it showed that no polyn omial-time (rando mized) algo rithm can find a K · 2 log 1 − δ d -sparse solution x w ith k Ax − b k 2 2 ≤ d C 1 n 1 − C 2 with hig h pr obability , where δ, C 1 , C 2 are arb i- trary positive scalars. Another work ( Zhang et al. , 20 14 ) showed th at und er the Gaussian linear model, th ere exists a gap between the mean square loss that c an be achieved by p olynom ial-time algor ithms an d the statistically optimal mean squared erro r . Th ese two works fo cus on estimation of lin ear models and impose d istributional assumptions re- garding the input data. These results on estimation are dif- ferent in na tu re with our results on optimizatio n. In con trast, we focus on the o ptimization pr o blem itself. Our results apply to a variety of loss functio ns and p enalty function s, not limited to lin ear regression. Moreover , we do not make any distributional assumption regardin g the inpu t data. There rem ain several o p en questions. First, existing re sults mainly considered least square prob lems o r L q minimiza- tion problems. Sec o nd, existing results fo cused ma in ly on the L 0 penalty function. The co mplexity o f Problem 1 with general loss f unction and pen alty functio n is yet to b e es- tablished. T hings get complicated when p is a co ntinuou s function instead of th e discrete L 0 norm functio n. The complexity for finding an ǫ -o ptimal solution with g eneral ℓ and p is not fully understo od. W e will add ress these ques- tions in this paper . 3 Assumptions In this section , we state the two critical a ssum ptions that lead to the strong NP-ha rdness results: o ne fo r the p enalty function p , the oth er one for the loss function ℓ . W e ar- gue that these assumptio ns are essential and very general. They app ly to a b road class of lo ss fun ctions and pen alty function s that are comm o nly used. 3.1 Assumption About Sparse Penalty Throu g hout th is pap er , we make the f ollowing assumption regarding the spar se penalty fu n ction p ( · ) . Assumption 1. The penalty function p ( · ) satisfies the fol- Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions lowing c ondition s: (i) (Monoto nicity) p ( · ) is no n -decreasing on [0 , + ∞ ) . (ii) (Concavity) The re exists τ > 0 such th at p ( · ) is con- cave but not linea r on [0 , τ ] . In word s, condition (ii) mean s that the co n cave penalty p ( · ) is no n linear . Assump tion 1 is the most gen eral con dition on penalty fu nctions in the existing literature of spar se o p- timization. Below we p resent a few such examples. 1. In variable selection problem s, the L 0 penalization p ( t ) = I { t 6 =0 } arises n aturally as a p enalty f or the number of factors selected. 2. A natural generalization of the L 0 penalization is the L p penalization p ( t ) = t p where (0 < p < 1) . The co rrespon ding minimization problem is called the bridge regression prob lem ( Frank & Friedman , 1993 ). 3. T o obtain a hard-th resholding estimator , Antoniadis & Fan ( 2001 ) use the penalty func- tions p γ ( t ) = γ 2 − (( γ − t ) + ) 2 with γ > 0 , wher e ( x ) + := ma x { x, 0 } de notes the positive p art o f x . 4. Any penalty fu n ction that belongs to the fo lded con- cav e penalty family ( Fan et al. , 2 0 14 ) satisfies the con - ditions in Theo rem 1 . Examples includ e the SCAD ( Fan & Li , 2 001 ) and the MCP ( Zhang , 2010 a ), wh ose deriv ati ves on (0 , + ∞ ) are p ′ γ ( t ) = γ I { t ≤ γ } + ( aγ − t ) + a − 1 I { t>γ } and p ′ γ ( t ) = ( γ − t b ) + , respec- ti vely , where γ > 0 , a > 2 and b > 1 . 5. The conditions in Theorem 1 a r e also satisfied by the clipped L 1 penalty function ( Antoniadis & Fan , 2001 ; Zhang , 2010b ) p γ ( t ) = γ · min ( t, γ ) with γ > 0 . This is a special case of the piecewise linear pen alty function : p ( t ) =  k 1 t if 0 ≤ t ≤ a k 2 t + ( k 1 − k 2 ) a if t > a where 0 ≤ k 2 < k 1 and a > 0 . 6. Another family of penalty fun ctions which br idges the L 0 and L 1 penalties a r e the fr action pen alty func tions p γ ( t ) = ( γ + 1) t γ + t with γ > 0 ( Lv & Fan , 2 009 ). 7. The family of log-p e nalty f u nctions: p γ ( t ) = 1 log(1 + γ ) log(1 + γ t ) with γ > 0 , also b ridges th e L 0 and L 1 penalties ( Candes et al. , 2008 ). 3.2 Assumption About Loss Function W e state our assumption about the loss function ℓ . Assumption 2. Let M b e an arbitrary constant. F o r any interval [ τ 1 , τ 2 ] where 0 < τ 1 < τ 2 < M , there exis ts k ∈ Z + and b ∈ Q k such that h ( y ) = P k i =1 ℓ ( y , b i ) has the fo llo wing p r operties: (i) h ( y ) is con vex an d Lipschitz c ontinuo us on [ τ 1 , τ 2 ] . (ii) h ( y ) has a un ique minimizer y ∗ in ( τ 1 , τ 2 ) . (iii) Ther e exis ts N ∈ Z + , ¯ δ ∈ Q + and C ∈ Q + such that when δ ∈ (0 , ¯ δ ) , we have h ( y ∗ ± δ ) − h ( y ∗ ) δ N ≥ C . (iv) h ( y ∗ ) , { b i } k i =1 can be r e p r esen ted in O (lo g 1 τ 2 − τ 1 ) bits. Assumption 2 is a critical, but very gene ral, assumption regarding the loss fun ction ℓ ( y , b ) . Condition (i) requir es conv exity and Lipschitz continu ity within a n eighbo r hood . Conditions ( ii) , (iii) essentially r equire that, given an in- terval [ τ 1 , τ 2 ] , one can artificially pick b 1 , . . . , b k to con- struct a function h ( y ) = P k i =1 ℓ ( y , b i ) such that h has its unique minim izer in [ τ 1 , τ 2 ] and has eno ugh curvature near the minim izer . This proper ty ensures that a b ound o n the minimal value of h ( y ) ca n b e translated to a meanin gful bound o n the m inimizer y ∗ . Th e co nditions (i), (ii), (iii) are typical properties that a loss f unction usually satisfi es. Condition (iv) is a techn ical co ndition that is used to av oid dealing with in fin itely-long irration al num bers. It can be easily verified for alm o st all commo n loss f unctions. W e will show that Assumptions 2 is satisfied by a variety of lo ss fu nctions. An (incomplete) list is gi ven below . 1. In the least squares r egression, the loss function has the fo rm n X i =1  a T i x − b i  2 . Using our n o tation, the cor respond ing loss function is ℓ ( y , b ) = ( y − b ) 2 . For all τ 1 , τ 2 , we choose an arbitrary b ′ ∈ [ τ 1 , τ 2 ] . W e can verify that h ( y ) = ℓ ( y , b ′ ) satisfies all the condition s in Assump tion 2 . 2. In the linear m odel with Laplac ia n noise, the negati ve log-likelihood func tio n is n X i =1   a T i x − b i   . So the lo ss function is ℓ ( y , b ) = | y − b | . As in the case of le a st squares regression, th e loss function satisfy Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions Assumption 2 . Similar argument also holds when we consider th e L q loss | · | q with q ≥ 1 . 3. In robust regression, we con sider the Huber loss ( Huber , 1 9 64 ) which is a mixture of L 1 and L 2 norms. The lo ss f u nction takes the fo rm L δ ( y , b ) =  1 2 | y − b | 2 for | y − b | ≤ δ, δ ( | y − b | − 1 2 δ ) otherwise. for some δ > 0 wh ere y = a T x . W e th en v erify that Assumption 2 is satisfied. For any in terval [ τ 1 , τ 2 ] , we pick an arbitr a r y b ∈ [ τ 1 , τ 2 ] an d let h ( y ) = ℓ ( y , b ) . W e can see that h ( y ) satisfies all the cond itio ns in As- sumption 2 . 4. In Poisson regression ( Cameron & Tri vedi , 201 3 ), the negativ e lo g-likelihood min imization is min x ∈ R d − log L ( x ; A, b ) = min x ∈ R d n X i =1 (exp( a T i x ) − b i · a T i x ) . W e now show that ℓ ( y , b ) = e y − b · y satisfies As- sumption 2 . For any interval [ τ 1 , τ 2 ] , we choose q an d r such that q /r ∈ [ e τ 1 , e τ 2 ] . Note that e τ 2 − e τ 1 = e τ 1 + τ 2 − τ 1 − e τ 1 ≥ τ 2 − τ 1 . Also, e τ 2 is boun ded by e M . Thu s, q , r can b e cho sen to be po lynomia l in ⌈ 1 / ( τ 2 − τ 1 ) ⌉ by letting r = ⌈ 1 / ( τ 2 − τ 1 ) ⌉ and q be some nu mber less than r · e M . Th e n , we ch oose k = r and b ∈ Z k such that h ( y ) = P k i =1 ℓ ( y , b i ) = r · e y − q · y . Le t us verif y Assumptio n 2 . (i) , (iv) are straigh tf orward by our co nstruction . For (ii), n ote that h ( y ) take its minimum at ln( q /r ) which is inside [ τ 1 , τ 2 ] by o ur constru ction. T o verify (iii), c o nsider the seco n d ord er T aylor expan sion of h ( y ) a t ln( q /r ) , h ( y + δ ) − h ( y ) = r · e y 2 · δ 2 + o ( δ 2 ) ≥ δ 2 2 + o ( δ 2 ) , W e can see that (iii) is satisfied. Th erefore, Assump- tion 2 is satisfied. 5. In logistic regression, th e negati ve lo g-likelihood function minimization is min x ∈ R d n X i =1 log(1 + exp( a T i x )) − n X i =1 b i · a T i x. W e claim that the loss fu n ction ℓ ( y , b ) = log(1 + exp( y )) − b · y satisfies Assumption 2 . By a similar ar- gument as the o ne in Poisson regression , we can verify that h ( y ) = P r i =1 ℓ ( y , b i ) = r log(1 + exp( y )) − q y where q /r ∈ [ e τ 1 1+ e τ 1 , e τ 2 1+ e τ 2 ] an d q , r are poly nomial in ⌈ 1 / ( τ 2 − τ 1 ) ⌉ satisfies all the con ditions in Assump- tion 2 . For (ii) , o bserve that ℓ ( y , b ) take its minimum at y = ln q/r 1 − q/r . T o verify (iii), we consider the sec- ond o rder T aylor expansio n at y = ln q/r 1 − q/r , whic h is h ( y + δ ) − h ( y ) = q 2(1 + e y ) δ 2 + o ( δ 2 ) where y ∈ [ τ 1 , τ 2 ] . Note that e y is boun d ed by e M , which can b e c o mputed beforehand . As a result, ( iii) holds as well. 6. In th e m ean estimatio n of inv erse Gaussian mod e ls ( McCullagh , 1984 ), the negati ve lo g-likelihood func- tion m inimization is min x ∈ R d n X i =1 ( b i · p a T i x − 1) 2 b i . Now we show that the loss fu nction ℓ ( y , b ) = ( b · √ y − 1) 2 b satisfies Assump tion 2 . By setting the deriv ati ve to be ze ro with regard to y , we c a n see that y take its minimum at y = 1 /b 2 . Thus for any [ τ 1 , τ 2 ] , we choose b ′ = q /r ∈ [1 / √ τ 2 , 1 / √ τ 1 ] . W e can see that h ( y ) = ℓ ( y , b ′ ) satis fies all the conditions in As- sumption 2 . 7. In the estimation of generalized linear model un der the exponential distribution ( McCullagh , 198 4 ), th e nega- ti ve log- likelihood function m inimization is min x ∈ R d − log L ( x ; A, b ) = min x ∈ R d b i a T i x + log( a T i x ) . By setting the deriv ati ve to 0 with regard to y , we can see that ℓ ( y , b ) = b y + log y has a uniqu e min im izer at y = b . Thu s by choosing b ′ ∈ [ τ 1 , τ 2 ] approp r iately , we can readily show that h ( y ) = ℓ ( y , b ′ ) satisfies all the co nditions in Assumption 2 . T o sum up, the co mbination of an y loss fu nction given in Section 3.1 and an y penalty fun ction giv en in Section 3.2 results in a strongly NP- hard op timization p roblem. 4 Main Results In this section, we state our main results on the strong NP- hardne ss of Problem 1. W e warm up with a prelim in ary result fo r a special case o f Prob lem 1. Theorem 1 (A Prelim inary Result) . Let Assump tion 1 hold, and let p ( · ) be twice continuo usly differ en tiable in (0 , ∞ ) . Then the minimization pr oblem min x ∈ R n k Ax − b k q q + λ d X j =1 p ( | x j | ) , (1) is str o ngly NP-har d. Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions The result shows that many of the pen alized least squares problem s, e.g. , ( Fan & Lv , 201 0 ), while enjoying small estimation erro r s, ar e h ard to compute. It sugg ests that there do e s not exist a fu lly polynom ial-time ap proxim ation scheme for Prob lem 1. It has no t an swered the q uestion: whether one can app r oximately solve Pro blem 1 within cer- tain co nstant er ror . Now we show th at it is no t even p ossible to efficiently ap- proxim a te th e glo bal optima l so lution of Problem 1, unless P = N P . Given an op timization problem min x ∈ X f ( x ) , we say tha t a solution ¯ x is ǫ -o ptimal if ¯ x ∈ X an d f ( ¯ x ) ≤ inf x ∈ X f ( x ) + ǫ. Theorem 2 (Stron g NP-Hardn e ss of Problem 1) . Let As- sumptions 1 and 2 hold , a n d let c 1 , c 2 ∈ [0 , 1) be ar- bitrary such that c 1 + c 2 < 1 . Then it is str o ngly NP- har d to find a λ · κ · n c 1 d c 2 -optimal solution of Pr ob- lem 1, wher e d is the dimension of variable space and κ = min t ∈ [ τ / 2 ,τ ] { 2 p ( t/ 2) − p ( t ) t } . The no n-appr oximab le er ror in Theorem 2 inv olves the constant κ which is determin ed by the spar se pen alty func - tion p . In the case whe r e p is the L 0 norm fun ction, we can take κ = 1 . In th e case of piecewise lin e a r L 1 penalty , we have κ = ( k 1 − k 2 ) / 4 . In the case of SCAD pen alty , we have κ = Θ( γ 2 ) . According to Theo r em 2 , the non -appro ximable error λ · κ · n c 1 d c 2 is determined by three factors: (i) proper ties of the regularizatio n penalty λ · κ ; (ii) d ata size n ; an d (iii) di- mension or numb er of v ariables d . This result illu strates a fundam ental gap that can no t b e closed by any p o lynomia l- time deterministic a lgorithm. This gap scales u p when ei- ther the data size or the number of variables in creases. In Section 5 .1, we will see th at this g ap is su bstantially larger than the desired estimation p recision in a special case of sparse line a r regression . Theorem s 1 and 2 validate the long-lastin g belief that op- timization in volving non conve x penalty is har d. Mo re im- portantly , T heorem 2 p rovide lower bo unds fo r the opti- mization error that can b e achieved by a ny p olynom ial-time algorithm . This is o ne of the stro ngest fo rms of hard ness result fo r co ntinuou s o ptimization. 5 An A pplication and Remarks In this section, we an alyze the strong NP-hardn ess results in the special ca se of linear regression with SCAD penalty (Problem 1) . W e give a few re m arks on the implicatio n of our ha r dness r esults. 5.1 Hardness of Regr ession with SCAD Pe nalty Let us try to u nderstand how significan t is th e non- approx imable error o f Prob lem 1. W e conside r the special case of linear mo dels with SCAD pen alty . L et the inp u t data ( A, b ) be generated by the linear model A ¯ x + ε = b , where ¯ x is the unk nown true sp a rse co efficients and ε is a zer o-mean multiv ariate subgaussian no ise. Giv en the data size n and variable dimension d , we f ollow ( Fan & Li , 2001 ) an d obtain a special case of Problem 1 , g iven by min x 1 2 k Ax − b k 2 2 + n d X j =1 p γ ( | x j | ) , (2) where γ = p log d/n . ( Fan & Li , 200 1 ) showed that the optimal solu tion x ∗ of pr oblem ( 2 ) h as a small statistical error, i.e., k ¯ x − x ∗ k 2 2 = O  n − 1 / 2 + a n  , where a n = max { p ′ λ ( | x ∗ j | ) : x ∗ j 6 = 0 } . ( Fan et al. , 2015 ) further showed that we only need to find a √ n log d -op timal solutio n to ( 2 ) to ach iev e such a sma ll estimation error . Howe ver, Th eorem 2 tells us that it is no t possible to com- pute an ǫ d,n -optimal solution for p roblem ( 2 ) in polyn omial time, wh ere ǫ d,n = λκn 1 / 2 d 1 / 3 (by letting c 1 = 1 / 2 , c 2 = 1 / 3 ). In the special case of prob lem ( 2 ), we can verify that λ = n and κ = Ω( γ 2 ) = Ω(log d/n ) . As a result, we see that ǫ d,n = Ω( n 1 / 2 d 1 / 3 ) ≫ p n log d, for high values of the dimen sion d . Acco rding to T heorem 2, it is strong ly NP-har d to appr oximately solve pro b lem ( 2 ) within the required statistical prec ision √ n log d . T his result illustrates a sh a rp co ntrast between statistical pr oper- ties of sparse estimation and the w orst-case com putational complexity . 5.2 Remarks on the NP-Hardness Results As illustrated b y the pr eceding a nalysis, the non- approx imibility o f Problem 1 suggests that comp uting the sparse estimator is hard. The results sugg est a funda - mental conflict between co mputation efficiency and esti- mation accuracy in sparse data analysis. Altho u gh the re- sults seem negative, they sh ould not discou rage researc hers from stud ying comp utational perspectives of sp arse opti- mization. W e make the f ollowing remarks: 1. Theore m s 1, 2 are worst-case co mplexity results. They suggest that one can not find a tractable solution to the sparse optimizatio n problems, withou t making any ad ditional assumptio n to rule ou t the worst-case instances. 2. Our results d o not exclude the po ssibility that, under more stringent modelin g and distributional assump- Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions tions, the pr oblem would be tractab le with hig h prob- ability o r o n average. In short, the sp a rse optimiza tio n Prob lem 1 is fund a men- tally h ard f r om a purely computa tio nal per spective. This paper together with the prior related works provide a c o m- plete answer to the comp utational co mplexity o f spar se o p- timization. 6 Pr oof of Theorem 1 In this section, we prove Theorem 1 . The proof of The- orems 2 is d eferred to the ap pendix which is b ased on the idea of th e proo f in this section. W e constru ct a polyno mial-time redu ction f rom the 3 -partition pr ob lem ( Garey & Joh nson , 1978 ) to the spa r se optimiza tio n prob- lem. Given a set S o f 3 m integers s 1 , ...s 3 m , the th ree partition pr oblem is to determine wh ether S can be parti- tioned into m triplets such that the sum of the numbers in each su bset is eq ual. This prob le m is k nown to be stron gly NP-hard ( Garey & Johnson , 19 78 ). The main proo f id ea bears a similar spirit a s the works b y Huo & Chen ( 2 010 ), Chen et al. ( 2014 ) and Chen & W ang ( 201 6 ). The proof s of a ll the lemmas can be foun d in the ap pendix. W e first illustrate sev eral pro p erties of the p e nalty fun ction if it satisfies the c ondition s in Theo rem 1 . Lemma 3. If p ( t ) satisfies th e con d itions in Theo r e m 1 , then for any l ≥ 2 , and any t 1 , t 2 , . . . , t l ∈ R , we h ave p ( | t 1 | ) + · · · + p ( | t l | ) ≥ min { p ( | t 1 + · · · + t l | ) , p ( τ ) } . Lemma 4. If p ( t ) satisfies th e con d itions in Theo r e m 1 , then ther e exists τ 0 ∈ (0 , τ ) such th at p ( · ) is con cave but no t linear on [0 , τ 0 ] and is twice con tinuously differ - entiable on [ τ 0 , τ ] . F urthermor e, for any ˜ t ∈ ( τ 0 , τ ) , let ¯ δ = min { τ 0 / 3 , ˜ t − τ 0 , τ − ˜ t } . Th en for any δ ∈ (0 , ¯ δ ) l ≥ 2 , and a ny t 1 , t 2 , . . . , t l such that t 1 + · · · + t l = ˜ t , we have p ( | t 1 | ) + · · · + p ( | t l | ) < p ( ˜ t ) + C 1 δ only if | t i − ˜ t | < δ for some i while | t j | < δ for all j 6 = i , wher e C 1 = p ( τ 0 / 3)+ p (2 τ 0 / 3) − p ( τ 0 ) τ 0 / 3 > 0 . In ou r pro of o f Theo rem 1 , we will co n sider the fo llowing function g θ ,µ ( t ) := p ( | t | ) + θ · | t | q + µ · | t − ˆ τ | q with θ , µ > 0 , where ˆ τ is an arbitrary fixed ra tio nal numb er in ( τ 0 , τ ) . W e have th e following lem ma ab out g θ ,µ ( t ) . Lemma 5. If p ( t ) satisfies th e con d itions in Theo r e m 1 , q > 1 , an d τ 0 satisfies the pr operties in Lemma 4 , the n ther e e xist θ > 0 and µ > 0 such that for any θ ≥ θ a nd µ ≥ µ · θ , the follo wing pr o perties ar e satisfied: 1. g ′′ θ ,µ ( t ) ≥ 1 fo r any t ∈ [ τ 0 , τ ] ; 2. g θ ,µ ( t ) has a uniqu e global min imizer t ∗ ( θ, µ ) ∈ ( τ 0 , τ ) ; 3. Let ¯ δ = min { t ∗ ( θ, µ ) − τ 0 , τ − t ∗ ( θ, µ ) , 1 } , th en for any δ ∈ (0 , ¯ δ ) , we ha ve g θ ,µ ( t ) < h ( θ, µ ) + δ 2 only if | t − t ∗ ( θ, µ ) | < δ , wh e re h ( θ , µ ) is th e minima l valu e of g θ ,µ ( t ) . Lemma 6. If p ( t ) satisfies th e con d itions in Theo r e m 1 , q = 1 , an d τ 0 satisfies the pr operties in Lemma 4 , the n ther e exist ˆ µ > 0 such that for any µ ≥ ˆ µ , the following pr op e rties ar e satisfie d : 1. g ′ 0 ,µ ( t ) < − 1 for an y t ∈ [ τ 0 , ˆ τ ) and g ′ 0 ,µ ( t ) > 1 for any t ∈ ( ˆ τ , τ ] ; 2. g 0 ,µ ( t ) has a u nique g lobal minimizer t ∗ (0 , µ ) = ˆ τ ∈ ( τ 0 , τ ) ; 3. Let ¯ δ = min { ˆ τ − τ 0 , τ − ˆ τ , 1 } , then for any δ ∈ (0 , ¯ δ ) , we h ave g 0 ,µ ( t ) < h (0 , µ ) + δ 2 only if | t − ˆ τ | < δ . By co mbining the above re sults, we have th e following lemma, which is useful in ou r pr oof of Theorem 1 . Lemma 7. Su ppose p ( t ) satisfies the conditio ns in The- or em 1 an d τ 0 satisfies th e pr operties in Lemma 4 . Let h ( θ, µ ) and t ∗ ( θ, µ ) be as defin ed in Lemma 5 and Lemma 6 respectively for the case q > 1 and q = 1 . Then we can find θ and µ such tha t for any l ≥ 2 , t 1 , . . . , t l ∈ R , l X j =1 p ( | t j | ) + θ ·       l X j =1 t j       q + µ ·       l X j =1 t j − ˆ τ       q ≥ h ( θ, µ ) . Mor e over , let ¯ δ = min n τ 0 3 , t ∗ ( θ ,µ ) − τ 0 2 , τ − t ∗ ( θ ,µ ) 2 , 1 , C 1 o wher e C 1 is defined in Lemma 4 , then for a ny δ ∈ (0 , ¯ δ ) , we h ave l X j =1 p ( | t j | ) + θ ·       l X j =1 t j       q + µ ·       l X j =1 t j − ˆ τ       q < h ( θ , µ ) + δ 2 (3) holds only if | t i − t ∗ ( θ, µ ) | < 2 δ for so me i while | t j | ≤ δ for all j 6 = i . Pr oof of Th eor em 1 . W e p resent a p olynom ial time re- duction to pr oblem ( 1 ) from the 3-par tition prob lem. For any given instance of the 3 -partition prob le m with b = ( b 1 , . . . , b 3 m ) , we con sider the minimization pr oblem min x f ( x ) in the form o f ( 1 ) with x = { x ij } , 1 ≤ i ≤ Strong NP-Hardness f or Sparse Optimization with Conca ve Penalty Functions 3 m, 1 ≤ j ≤ m , wher e f ( x ) := m X j =2      3 m X i =1 b i x ij − 3 m X i =1 b i x i 1      q + 3 m X i =1       ( λθ ) 1 q m X j =1 x ij       q + 3 m X i =1       ( λµ ) 1 q   m X j =1 x ij − ˆ τ         q + λ 3 m X i =1 m X j =1 p ( | x ij | ) . Note that the lower bo unds θ , µ , an d ˆ µ only depe n d on the penalty fu nction p ( · ) , we can choo se θ ≥ θ and µ ≥ µ θ if q > 1 , o r θ = 0 an d µ ≥ ˆ µ if q = 1 , such tha t ( λθ ) 1 /q and ( λµ ) 1 /q are bo th r ational numb ers. Since ˆ τ is also r ational, all the co e fficients of f ( x ) are of finite size and indep endent of th e in p ut size of the given 3-par tition instance. Ther e fore, th e minimization p roblem min x f ( x ) has po lynom ia l size with resp ect to the giv en 3-partition instance. For any x , by Lemma 7 , f ( x ) ≥ 0 + λ · 3 m X i =1 ( m X j =1 p ( | x ij | ) + θ ·       m X j =1 x ij       q + µ ·       m X j =1 x ij − ˆ τ       q ) ≥ 3 mλ · h ( θ , µ ) . (4) Now we claim th at there exists an equitab le par tition to the 3-partition prob lem if and on ly if the optimal value of f ( x ) is smaller than 3 mλ · h ( θ , µ ) + ǫ where ǫ is specified later . On one hand, if S can be equa lly par titioned into m subsets, then we define x ij =  t ∗ ( θ, µ ) if b i belongs to th e j th subset ; 0 otherwise . It c an be easily verified th at these x ij ’ s satisfy f ( x ) = 3 mλ · h ( θ , µ ) . Th en due to ( 4 ), we k n ow that these x ij ’ s provide an op tim al solution to f ( x ) with op timal value 3 mλ · h ( θ, µ ) . On the oth er hand, suppo se the optimal value o f f ( x ) is 3 mλ · h ( θ , µ ) , a n d there is a po lynomial- tim e algor ith m that solves ( 1 ). Then for δ = min ( τ 0 8 P 3 m i =1 b i , ¯ δ ) and ǫ = min { λδ 2 , ( τ 0 / 2) q } where ¯ δ = min  τ 0 3 , t ∗ ( θ, µ ) − τ 0 2 , τ − t ∗ ( θ, µ ) 2 , p ( τ 0 / 3) + p (2 τ 0 / 3) − p ( τ 0 ) τ 0 / 3 , 1  , we are able to find a ne ar-optimal solution x such that f ( x ) < 3 mλ · h ( θ , µ ) + ǫ within a polyno mial time of log(1 /ǫ ) a nd the size of f ( x ) , which is po lynomial with respect to the size of the given 3-partition instance. Now we show that we c a n find an equitable partition based on this near-optimal solution. By the definitio n of ǫ , f ( x ) < 3 mλ · h ( θ, µ ) + ǫ implies m X j =1 p ( | x ij | ) + θ       m X j =1 x ij       q + µ ·       m X j =1 x ij − τ       q

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment