$L_0$ regularized estimation for nonlinear models that have sparse underlying linear structures

L 0 regularized estimation for nonlinear mo dels that ha v e sparse underlying linear structures Zhiyi Chi Departmen t of S t atistics Univ ersit y of C on n ec ticut 215 Glen bro ok Road, U-4120 Storrs, CT 06269, US A Email: zchi@stat.uco nn.edu August 14, 2018 Abstract W e s tudy the estimation of β for the nonlinear mo del y = f ( X ⊤ β ) + ǫ when f is a no nlinear transformation tha t is known, β has s parse no nzero co- ordinates, and the n um ber of observ ations can be muc h smaller than that of parameters ( n ≪ p ). W e show that in order to b ound the L 2 error of the L 0 regular iz ed estimato r b β , i.e., k b β − β k 2 , it is suﬃcien t to establish tw o condi- tions. Based on this, w e obtain b ounds o f the L 2 error for (1) L 0 regular iz ed maximum lik eliho o d estimation (MLE) for exp onential linear models and (2) L 0 regular iz ed least square (LS) regressio n for the more general case where f is analytic. F or the ana lytic ca s e, w e rely o n pow er series expansion o f f , whic h requires taking in to account the singular ities of f . Keywor ds and ph r ases. Regula r ization, spa rsity , MLE, regre ssion, v ar iable s e- lection, parameter estimatio n, nonlinearity , p ow er se r ies expansion, ana lytic, exp onential. AMS 2000 subje ct classiﬁc ation. P rimary 62 G05; secondar y 62 J02. A cknow le dgement. Resear ch partially supp orted by NSF Grant DMS-07-060 48 and NIH Gra nt MH-680 28. 1 In tro duction Regularized estimation for sp ars e mo dels that ha ve a large n umber of parame- ters comparing to th at of observ ations has b ecome an important topic in statis- tics, mac h ine learning, and a few other areas ( Bunea et al. 2007 , Cand` es & T ao 2007 , Donoho et a l. 2006 , Efr on et al. 2004 , Field 1994 , Natara j an 1995 , Zhao & Y u 2006 ). The researc h in these areas has b een fo cused on regularized least square (LS) regression for sparse linear mo dels y = X β + ǫ , where y ∈ R n is the resp onse vecto r, X ∈ R n × p the d esign matrix, β ∈ R p the v ector of parameters, and ǫ ∈ R n the random error v ector that has mean 0 give n X . By sparse w e mean the n umber of nonzero co ordin ates o f β is muc h sm aller than p ( W asserman & Roeder 200 9 ). On the other hand, nonlinear mo dels suc h as logistic mo d els that ha v e un derlying linear structur es are widely u sed. T he general form of suc h mo dels is y = f ( X ⊤ β ) + ǫ, (1.1) 1 where f : R → R is a nonlinear function that may or ma y not b e known. Here and henceforth, for x = ( x 1 , . . . , x n ) ⊤ ∈ R n , w e denote f ( x ) = ( f ( x 1 ) , . . . , f ( x n )) ⊤ . The n eed for nonlinear mo dels with sparse underlying linear structure is clearly laid out in seve ral recen t works in neur oscience ( Sharp ee et al. 2008 , 2004 ) an d some algorithms based on information criteria ha v e b een prop osed to estimate not only β but also f . Ho wev er, at this p oint , it seems v ery hard to ev aluate the estimation precision of those al gorithms. In this article w e are con ten t to establish the L 2 precision of L 0 regularized estimator of β for sparse mo dels, when the design matrix X is ﬁxed and f is known . W e sh all allo w n ≪ p . Despite its limitati on from a compu tational p oint of view, the L 0 regularization is an imp ortan t and conceptually simple instrument for parameter estimation and mo del selection ( Ak aik e 1974 , Huang et al. 2008 , Sc hw arz 1978 ). Besides, since many impr o vemen ts o v er the L 0 regularization are ac hiev ed b y taking adv an tage of prop erties of linear mo dels that ma y f ail to b e had b y nonlinear m o dels ( Zhao & Y u 2006 ), it is reasonable to tak e L 0 regularization as a protot y p e for further study on nonlinear mo dels. With this in mind, our concern is wh ether goo d estimation pr ecision c ould b e achiev ed instead of how fast to ac hieve it. In Secti on 2 , w e establish a basic result. W e show that pro vided tw o condi- tions are s atisﬁed, the L 2 error of the L 0 regularized estimator satisﬁes a quadr atic inequalit y whic h yields th e estimation p recision. Consequen tly , establishing the es- timation pr ecision is reduced to establishing the t wo conditions. As a m inor b eneﬁt of the resu lt, ind ep end en ce of the coordinates o f ǫ in g eneral need n ot b e assumed. W e will also set up notatio n and collect other preliminary results in Section 2 . After that, w e shall establish the alluded conditions for exp onentia l linear mo dels and for analytic m o dels, i.e., mo d els with an alytic f . Although a sp ecial case of analytic models, exp on ential lin er mo dels are m uc h simpler to handle du e to it s explicit expression of the conditional d en sit y of y giv en X . F or these mo dels, w e consider the maxim um lik eliho o d estimato r (MLE). T he discussion is in S ection 3 . F or analytic models, w e will consider the L S regression. Sections 4 and 5 estab- lish the t w o conditions, resp ectiv ely . In Section 5 , the app roac h is to us e in ﬁnite p o w er ser ies expansion of f . The main complexit y of the approac h arises when f has singu larities on C . T o illustrate, w e w ill u se as w orkin g examples the logistic regression mo d el in S ection 3 and a noise corru pted ve rsion of it in Section 5 . Mo st of the pr o ofs are collected in Secti on 6 . 2 Preliminaries 2.1 Notation Denote by X ⊤ 1 , . . . , X ⊤ n the row vecto rs of X , with X i ∈ R n . De note b y V 1 , . . . , V p the column v ectors of X . W e shall alw a ys assume that X is ﬁ xed and imp ose the 2 condition th at V j 6 = 0. In fact, if a column v ector of X is 0, then it h as no eﬀect on y and sh ould b e remov ed. In the subsequent d iscussion, the column vec tors of X should b e understo o d as un normalized. It is therefore helpful to think of X as a collect ion of co v ariate v ectors registe red e xactly as they are observ ed. F or S = { i 1 , . . . , i k } , with 1 ≤ i 1 < . . . < i k ≤ p , denote X S = ( V i 1 , . . . , V i k ), and for u ∈ R p , denote u S = ( u i 1 , . . . , u i k ) ⊤ . The supp ort of u is spt( u ) = { i : u i 6 = 0 } . Denote by k u k p the L p norm of u . If A is a set, den ote b y | A | its cardinalit y . The L 0 norm of u refers to | spt( u ) | and is often denoted b y k u k 0 . W e choose the notatio n | spt( u ) | sin ce it seems more intuitiv e. F or ϕ = ( ϕ 1 , . . . , ϕ n ) and x ∈ R n , wher e eac h ϕ i : R → R , denote ϕ ( x ) = ( ϕ 1 ( x 1 ) , . . . , ϕ n ( x n )) ⊤ . 2.2 General form of estimator and line of argumen t The general form of a n L 0 regularized estimator i s b β = arg min u ∈ D [ ℓ ( y , X u ) + c r | spt( u ) | ] , (2.1) where D is a pre-selected searc h domain in R p , ℓ ( y , X u ) is certain loss function, and c r > 0 is a tuning parameter. F or the MLE, ℓ ( y , X u ) is the min us log lik eliho o d, while for the LS regression, it is k y − X u k 2 2 . F or linear regression, D is typica lly set equal to R p . Ho w ev er, f or nonlinear regression, our p osition is that some constrain t on D is n eeded in order to co n trol the potentia lly large v ariation of the functional prop erty of f at d iﬀeren t p ossible v alues of X β . F or b oth the MLE and LS regression, the argumen t to establish the pr ecision of b β pro ceeds as follo ws. First, it is easy to sho w that b β satisﬁes an inequalit y of the follo wing form, G ( ψ ( X b β ) − ψ ( X β )) ≤ 2 |h ǫ, ϕ ( X b β ) − ϕ ( X β ) i| − c r ( | spt( b β ) | − | sp t( β ) | ) , (2.2) where G is a f u nction R n → R , ψ = ( ψ 1 , . . . , ψ n ) and ϕ = ( ϕ 1 , . . . , ϕ n ), w ith ψ i and ϕ i b eing fun ctions R → R . Then the follo wing t w o cond itions will b e established. Condition H1 Giv en q ∈ (0 , 1), there is c 1 = c 1 ( X, β , ϕ, q ) > 0, suc h that Pr  |h ǫ, ϕ ( X u ) − ϕ ( X β ) i| ≤ c 1 √ n k u − β k 1 , all u ∈ D  ≥ 1 − 2 q . The co eﬃcient 2 in 1 − 2 q is nonessentia l. It is f or ease of notation in the statemen ts of main resu lts. Condition H2 There is c 2 = c 2 ( X, β , ψ ) > 0, suc h that fo r all u ∈ D , G ( ψ ( X u ) − ψ ( X β )) ≥ c 2 n k u − β k 2 2 . 3 The constan ts c 1 and c 2 will be explictly constru cted. In g eneral, b oth dep end on X . Since we only c onsider ﬁxed d esign, they are nonrandom. W e will c h ec k the cond itions resp ectiv ely f or the MLE and LS r egression. O nce this is done, usin g th e next result, we then obtain a b oun d on k b β − β k 2 . Note that the result is stated in a little more general form as it do es not require that b β b e the one deﬁn ed b y ( 2.1 ). Prop osition 2.1 Supp ose Conditions H1 and H2 ar e satisﬁe d. If b β ∈ D is a r andom variable that always satisﬁes the ine q uality ( 2.2 ) with c r = 3 c 2 1 /c 2 , then, letting κ r = 3 c 1 /c 2 , Pr ( k b β − β k 2 ≤ κ r p | spt( β ) | √ n ) ≥ 1 − 2 q . In order f or the b ounds to b e meaningful, w e need to mak e sure κ r is not to o large, at lea st comparing to √ n . T h is will b e th e m ain consideration w hen w e try to establish C onditions H1 and H2 . Because Prop osition 2.1 pla ys a fun dament al role in our study , we give its pr o of b elo w. This is the only r esult wh ose p ro of app ears in t he m ain text. Pr o of of P r op osition 2.1 . Denote T = spt( β ) and S = sp t( b β ). Un d er Conditions H1 and H2 , w ith probabilit y at le ast 1 − 2 q , c 2 n k b β − β k 2 2 ≤ 2 c 1 √ n k b β − β k 1 − c r ( | S | − | T | ) ≤ 2 c 1 √ n p | S ∪ T |k b β − β k 2 − c r ( | S | − | T | ) , where the second inequalit y is du e to spt( β − b β ) ⊂ S ∪ T and Cauch y-Sc hw artz inequalit y . Let t = k b β − β k 2 and b = c 1 /c 2 . Then t 2 − 2 b p | S ∪ T | t √ n + 3 b 2 ( | S | − | T | ) n ≤ 0 . The left hand s ide is a quadr atic fu n ction in t . In order for the inequalit y to hold, there hav e to b e | S ∪ T | ≥ 3( | S | − | T | ) and 0 ≤ t ≤ b √ n h p | S ∪ T | + p | S ∪ T | + 3( | T | − | S | ) i . Let T 1 = T \ S and S 1 = S \ T . By | S ∪ T | = | S 1 | + | T | and | T | − | S | = | T 1 | − | S 1 | , 0 ≤ t ≤ b √ n  p | T | + | S 1 | + p | T | + 3 | T 1 | − 2 | S 1 |  . It is ea sy to see that due to | T 1 | ≤ | T | , the right hand side is a decreasing function in | S 1 | on [0 , ( | T | + 3 | T 1 | ) / 2], and hence is no greate r than its v alue at 0, whic h is ( b/ √ n )( p | T | + p | T | + 3 | T 1 | ) ≤ 3 b p | T | / √ n .  T o establish Conditions H1 and H2 , certain assumptions are needed. W e next discuss the ma jor assum ptions used by b oth the MLE and LS r egression. 4 2.3 T ail assumption on err ors T o establish Con d ition H1 , w e will need the follo wing assumption on ǫ . T ail assumption. There is σ > 0, su c h that for an y t , a 1 , . . . , a n ∈ R , Pr    n X i =1 a i ǫ i ! 2 > t 2 n X i =1 a 2 i    ≤ 2 exp  − t 2 2 σ 2  . (2.3) The tail assumption ( 2.3 ) rather mild. If ǫ ∼ N (0 , σ 2 Σ) and the sp ectral radius of Σ is n o greater than 1 , then ( 2.3 ) holds. In this case, ǫ 1 , . . . , ǫ n need n ot be indep en d en t. Moreo v er, if ǫ i are indep endent, su c h that E ( ǫ i ) = 0 and | ǫ i | ≤ σ for all i , then b y Ho eﬀdin g’s inequalit y ( Pol lard 198 4 ), ( 2.3 ) holds. 2.4 Coherence and restricted domains In order to identify β , some conditions on the correlations b et wee n the column v ectors of X are n eeded. The maximum correlation b et ween columns of X is µ ( X ) = sup 1 ≤ i 1. W e shall need the follo wing prop erties of D ( I , h ) . Prop osition 2.3 (1) If I is close d, then D ( I , 1) ⊂ D ( I , 2) ⊂ · · · ar e close d and (2) if I is c omp act and h < n (0) = 1 + µ ( X ) − 1 , then D ( I , h ) is c omp act. 3 Exp onent ial linear mo dels 3.1 Setup and main result Let µ b e a Borel measure on R with µ ( R ) > 0. Supp ose I ⊂ R is an nonempty op en inte rv al and { P t : t ∈ I } is a family of probabilit y distributions on R , such that with resp ect to µ ea c h P t has a den s it y p t ( y ) = exp { ty − Λ( t ) } , with Λ( t ) = ln  Z e ty µ ( dy )  . (3.1) As is well kno wn , Λ ∈ C ∞ ( I ) and for t ∈ I , E ( ξ ) = Λ ′ ( t ) , V a r ( ξ ) = Λ ′′ ( t ) > 0 , if ξ ∼ P t . (3.2) F or example, if µ = N (0 , σ 2 ), then Λ( t ) = σ 2 t 2 / 2 and P t = N ( σ 2 t, σ 2 ). If µ is the coun ting m easur e on { 0 , 1 } , then Λ( t ) = ln(1 + e t ) and P t is the Bernoulli distribution with parameter e t / (1 + e t ). W e n otice that giv en y , g ( t ) := p t ( y ) can 6 b e ananlyticall extended to the d omain { z ∈ C : Re( z ) ∈ I } . This f act is not needed in the r est of the sec tion. Assume th at gi v en X , y 1 , . . . , y n are in dep end en t, such that eac h y i ∼ P t i with t i = X ⊤ i β . The join t lik eliho o d of y 1 , . . . , y n is then n Y i =1 exp n y i X ⊤ i β − Λ( X ⊤ i β ) o = exp ( y ⊤ X β − n X i =1 Λ( X ⊤ i β ) ) . F rom the expression, the L 0 regularized MLE for β is b β = arg max u ∈ D " y ⊤ X u − n X i =1 Λ( X ⊤ i u ) − c r | spt( u ) | # . (3.3) If β ∈ D , then y ⊤ X β − n X i =1 Λ( X ⊤ i β ) − c r | spt( β ) | ≤ y ⊤ X b β − n X i =1 Λ( X ⊤ i b β ) − c r | spt( b β ) | , and hence n X i =1 h Λ( X ⊤ i b β ) − Λ( X ⊤ i β ) − Λ ′ ( X ⊤ i β ) X ⊤ i ( b β − β ) i ≤ h ǫ, X b β − X β i − c r ( | spt( b β ) | − | sp t( β ) | ) , where ǫ i = y i − E ( y i ) = y i − Λ ′ ( X ⊤ i β ) has mean 0 for eac h i . It is seen that the inequalit y giv es rise to ( 2.2 ) once w e deﬁne G ( x ) = n X i =1 x i , ψ i ( z ) = Λ( z ) − Λ ′ ( X ⊤ i β ) z , ϕ i ( z ) = z / 2 , (3.4) for x ∈ R n , z ∈ R a nd 1 ≤ i ≤ n . Theorem 3.1 Supp ose ǫ 1 , . . . , ǫ n satisfy ( 2.3 ) for some σ > 0 . Fix ν ∈ (0 , 1) . L et D = D ( I , n ( ν ) / 2) in ( 3.3 ) , wher e n ( ν ) is deﬁne d in ( 2.4 ) . Supp ose δ := inf t ∈ I Λ ′′ ( t ) > 0 . (3.5) Fix q ∈ (0 , 1 / 2) . L et c r = 3 σ 2 ln( p/q ) ν δ [1 + µ ( X )] max j k V j k 2 2 min j k V j k 2 2 in ( 3.3 ) . Then, pr ovide d β ∈ D , Pr ( k b β − β k 2 ≤ κ r p | spt( β ) | √ n ) ≥ 1 − 2 q , (3.6) wher e κ r = 3 σ p 2 ln( p/q ) ν δ [1 + µ ( X )] × √ n max j k V j k 2 min j k V j k 2 2 . 7 3.2 Commen ts Some commen ts on T heorem 3.1 are in order, many of them also apply to the r esults w e shall establish la ter. First, on the constraint b β ∈ D ( I , n ( ν ) / 2) . As noted in Section 2.4 , under mild conditions, for p with ln p = o ( n ), n ( ν ) ≍ p n/ ln p . In many cases, since it is reasonable t o assume that | sp t( β ) | = O (1) ( W asserman & Ro eder 2009 ), the constr aint then is very mild. Second, on k b β − β k 2 , which is determin ed b y κ r p | spt( β ) / √ n in ( 3.6 ). By ( 3.6 ), κ r = O ( R √ ln p ), where R = √ n max j k V j k 2 min j k V j k 2 2 = max j k V j k 2 / √ n min j k V j k 2 2 /n . Under mild conditions, R gro ws v ery slo wly with n . F or example, R = 1 if X is suc h that k V j k 2 = √ n (recall all V j ∈ R n ). W e sh all see suc h an example related to the logistic regression. As another e xample, supp ose all the np en tries of X are i.i.d. ∼ Z . If Z is b ounded, then clearly max j k V j k 2 / √ n = O (1). If Z ∼ N (0 , 1), then for a ny 0 < η < 1 / 2, Pr  max 1 ≤ j ≤ p k V j k ∞ ≤ p 2 ln( np/η )  ≥ 1 − 2 η . Since max j k V j k 2 ≤ √ n max j k V j k ∞ , then with high probabilit y , max j k V j k 2 / √ n = O ( p ln( np )). A t the same time , give n 0 < c < E ( Z 2 ), Pr  1 n min 1 ≤ j ≤ p k V j k 2 2 ≤ c  ≤ p Pr  Z 2 1 + · · · + Z 2 n ≤ nc  ≤ pψ ( c ) n , where ψ ( c ) = inf t> 0 E [ e tc − tZ 2 ] < 1. T herefore, for large n and p , with high p rob- abilit y , we ha v e max j k V j k 2 / √ n = O ( p ln( np )) or ev en O (1) on the one hand , and min j k V j k 2 2 /n ≥ c on the other, provided ln p = o ( n ). In p articular, sup p ose p = O ( n a ) for some a > 0. Then it is seen that R = O ( √ ln n ) or ev en O (1), and hence, by ( 3.6 ), with high probabilit y , k b β − β k 2 = O (ln n/ √ n ) or O ( √ ln p/ √ n ). Finally , the p recision also dep ends on δ = inf t ∈ I Λ ′′ ( t ). T o see why δ matters, consider the case wher e Λ ′′ ( t ) is un iformly small in an in terv al I that con tains all of X ⊤ i β . T his imp lies that Λ ′ ( t ) has little c hange on I , so by ( 3.2 ), E ( y 1 ), . . . , E ( y n ) are close to eac h other, and at the same time eac h y i has little v ariation. This giv es rise to a nearly “ﬂat” plot of y i vs X ⊤ i β , whic h m akes the iden tiﬁcation of β d iﬃ cu lt. That is to sa y the precision of the estimate ca nnot b e high. Certainly , if Λ ′′ ( t ) has a wide range on I , then using inf t ∈ I Λ ′′ ( t ) to set c r can b e quite conserv ativ e. Ho wev er, as X ⊤ i β a re un kno wn, it is the only w a y to accoun t f or all the p ossible v alues of X ⊤ i β , includ ing the least id eal o ne. 3.3 Logistic regression Supp ose y 1 , . . . , y n are indep endent Bernoulli r andom v ariables, suc h th at Pr { y i = 1 } = e X ⊤ i β / (1 + e X ⊤ i β ) , i = 1 , . . . , n. 8 The corresp ondin g p arametric family of densities is p t ( y ) = exp { ty − Λ( t ) } w ith resp ect to the counting measure on { 0 , 1 } , with Λ( t ) = ln(1 + e t ). F or i = 1 , . . . , n , ǫ i = y i − Pr { y i = 1 } ∈ ( − 1 , 1). Therefore, by Hoeﬀding’s inequalit y ( P ollard 1984 ), ( 2.3 ) holds with σ = 1. Giv en I ⊂ R , by d irect calculation, inf t ∈ I Λ ′′ ( t ) =  2 cosh M I 2  − 2 , with M I = su p t ∈ I | t | . Giv en q ∈ (0 , 1), let c r = 12 ln( p/q ) ν [1 + µ ( X )] × max j k V j k 2 2 min j k V j k 2 2 × cosh 2 M I 2 and κ r = 12 p 2 ln( p/q ) ν [1 + µ ( X )] × √ n max j k V j k 2 min j k V j k 2 2 × cosh 2 M I 2 . By T heorem 3.1 , if β ∈ D ( I , n ( ν ) / 2) , then, with pr obabilit y at le ast q , ( 3 .6 ) holds for the estimator b β = arg max ( y ⊤ X u − n X i =1 ln(1 + e X ⊤ i u ) − c r | spt( u ) | : u ∈ D ( I , n ( ν ) / 2) ) . If X is bin ary , i.e., X ij = 0 or 1, the r esult can b e somewhat simpliﬁed. Let ˜ X ∈ R n × ( p +1) suc h that ˜ X ij = 2 X ij − 1, for j ≤ p and ˜ X i,p +1 = 1. Also let ˜ β ∈ R p +1 suc h that ˜ β j = β j / 2 for j ≤ p and ˜ β p +1 = P p j =1 β j / 2. Then X ⊤ i β = ˜ X ⊤ i ˜ β . Let ˜ V 1 , . . . , ˜ V p +1 b e the column v ectors of ˜ X . Th en k ˜ V j k 2 = √ n . If w e regress y on ˜ X to estimate ˜ β , then c r = 12 ln[( p + 1) /q ] ν [1 + µ ( ˜ X )] × cosh 2 M I 2 , κ r = 12 p 2 ln( p/q ) ν [1 + µ ( ˜ X )] × cosh 2 M I 2 . In th e example, µ ( ˜ X ) can b e v ery sm all. If X ij are i.i.d. with Pr { X ij = 0 } = Pr { X ij = 1 } = 1 / 2, then for any 1 ≤ j < k ≤ p + 1, ˜ V ⊤ j ˜ V k ∼ P n i =1 η i , wher e η i are i.i.d. with Pr { η i = 1 } = Pr { η i = − 1 } = 1 / 2. By Ho eﬃng’s inequalit y , given t > 0, Pr ( | ˜ V ⊤ j ˜ V k | k ˜ V j k 2 k ˜ V k k 2 ≥ t √ n ) = Pr (      n X i =1 η i      ≥ t √ n ) ≤ 2 e − t 2 / 2 . It follo w s that g iv en δ ∈ (0 , 1), Pr ( µ ( ˜ X ) ≥ r 2 n ln ( p + 1) 2 δ ) ≤ p ( p + 1) 2 Pr ( | ˜ V ⊤ 1 ˜ V 2 | k ˜ V 1 k 2 k ˜ V 2 k 2 ≥ r 2 n ln ( p + 1) 2 δ ) ≤ δ. Therefore, with high prob ab ility , µ ( ˜ X ) = O ( p ln p/n ), whic h is v ery small f or rea- sonably large p a nd n . 9 4 Least square regression: preliminaries 4.1 Reform ulation and Condition H2 Supp ose that, with X ﬁxed, y i = f ( X ⊤ i β ) + ǫ i , 1 ≤ i ≤ n , where ǫ i are indep end en t with mean 0. The L 0 regularized LS estimator fo r β is b β = arg min u ∈ D  k y − f ( X u ) k 2 2 + c r | spt( u ) |  , (4.1) where, as in ( 3.3 ), D is a suitable searc h d omain in R p and c r is a regularization parameter. If β ∈ D , then k y − f ( X b β ) k 2 2 + c r | spt( b β ) | ≤ k y − f ( X β ) k 2 2 + c r | spt( β ) | , and hence k f ( X b β ) − f ( X β ) k 2 2 ≤ 2 h ǫ, f ( X b β ) − f ( X β ) i − c r ( | spt( b β ) | − | sp t( β ) | ) , whic h implies ( 2.2 ) once we deﬁne G ( x ) = k x k 2 2 , ψ i ( z ) = ϕ i ( z ) = f ( z ) , (4.2) for x ∈ R n , z ∈ R and 1 ≤ i ≤ n . By Prop osition 2.1 , all w e need to do then is to ﬁnd suitable constants c 1 and c 2 so that C onditions H1 and H2 a re satisﬁed. F or I ⊂ R that con tains at least t wo p oints, denote d ( f , I ) = inf  | f ( x ) − f ( y ) | | x − y | : x ∈ I , y ∈ I , x 6 = y  . W e start with the e asier task o f establishing Condition H2 . Prop osition 4.1 L et I ⊂ R b e an interval with p ositive length. Supp ose f is deﬁne d on I with d ( f , I ) > 0 . Fix ν ∈ (0 , 1) . L et D in ( 4.1 ) b e a subset of D ( I , n ( ν ) / 2) . If β ∈ D , then for G and ψ deﬁne d as in ( 4.2 ) , Condition H2 is satisﬁe d with c 2 = d ( f , I ) 2 ν [1 + µ ( X )] n min 1 ≤ j ≤ p k V j k 2 2 . As noted in Section 3.2 , un d er mild conditions, for large n and reasonably large p , c 2 ≍ 1. T herefore, b y Prop osition 2.1 , in order for the estimate b β to ha ve some reasonable p recision, the co eﬃcien t c 1 in Cond ition H1 has to b e of order o ( √ n ). T o this end, dep end ing on ho w well the nonlinear fun ction f b eha ves, some extra constrain ts need to b e imp osed on the domain D . Section 5 is d ev oted to establishing Condition H1 for the LS regressio n. Belo w w e outline the steps to b e tak en. 10 4.2 Observ ations that p oint to Condition H1 Recall that Condition H1 stipulates an upp er b ound on |h ǫ, f ( X u ) − f ( X β ) i| that has to hold simultane ously for a ll u . If f ( x ) = x , suc h a b ound is easy to ﬁnd due to th e conju gate rela tion h ǫ, f ( X u ) − f ( X β ) i = h X ⊤ ǫ, u − β i , as it then su ﬃ ces to ﬁnd a b ound for k X ⊤ ǫ k ∞ , which c an b e deriv ed from the tail assu mption on ǫ ( Cand ` es & Plan 2009 , Z hang 2009 ). F or nonlinear f , in general, there are n o similar applicable r elations. Ho wev er, lik e e x / (1 + e x ), in man y cases, f is analyt ic a nd so w e ma y exploit its p o w er s er ies expansions around diﬀerent p oin ts. By w orking with, say f ( x ) = x 2 , one could imag ine a kind of p o w er series expan s ion f ( X u ) = X M α h α ( u ) , suc h that eac h M α is some t yp e of (ro w-wise) monomial transformation of X , and h α ( u ) a v ector resu lting from a similar trans f ormation of u . This mak es it p ossible to rewrite h ǫ, f ( X u ) − f ( X β ) i as an inﬁn ite sum of h M ⊤ α ǫ, h α ( u ) − h α ( β ) i , which could lead t o a desirable b ound . The m etho d wo rks if f is a nalytic on the en tire C , or, more generally , when all the c o ordinates of X u and X β fall in to the disc of co n v ergence of the p o w er series expansion of f at 0. On th e other hand, when f has p oles as e x / (1 + e x ) do es, th e co ordinates of X u and X β ma y fall in to diﬀerent discs of con vergence of p o wer series expansion. Rou gh ly , to d eal with this pr oblem, our app roac h is to co ver the line segmen t connecting X u and X β with diﬀeren t discs of con vergence of p ow er s eries, apply the r esult obtained for th e case of single analytic disc, and patc h together the resulting b ound s. Th is turns out to account for most of t he complexit y in our treatmen t of the analytic case. One question is whether w e can just use a ﬁnite T a ylor expan s ion to deriv e b ound s for h ǫ, f ( X u ) − f ( X β ) i , th u s disp ensing with the assumption of analyticit y . The answ er seems to b e no in general. Unless f is a p olynomial, a ﬁnite T a ylor expansion of f ( X u ) − f ( X β ) has a remaind er term of the form R α ( u )[ h α ( u ) − h α ( β )], where R α ( u ) is a matrix that in general d ep ends on u . As a result, although for eac h individu al u , we can get a b ound for h ǫ ⊤ R α ( u ) , h α ( u ) − h α ( β ) i that holds with high probabilit y , there is no guaran tee to get that w ith high probabilit y , the b ound s hold simulta neously for all u , which is needed for establishing th e precision of b β . 5 Least square regression: cont in ued 5.1 Setup Let I ⊂ R b e a clo sed in terv al with p ositiv e length. In this sec tion, w e assu m e that f : I → R is analytic in a neigh b orho o d of I , i.e., f has a (unique) an alytic extension on to an open set in C con taining I . This is equiv alen t to sa ying th at 11 f ∈ C ∞ ( I ) and for eac h t ∈ I , there is r > 0, such that ∞ X k =0 | a k | r k < ∞ , w here a k = f ( k ) ( t ) k ! ∈ R , and f ( z + t ) = ∞ X k =0 a k z k , for all z ∈ ( − r, r ) with z + t ∈ I . (5.1) The radius o f con verge nce of the p o w er series ( 5.1 ), henceforth denoted by  ( f , t ), can b e determined by ( Rudin 1987 )  ( f , t ) =  lim k →∞ | a k | 1 /k  − 1 If | z | <  ( f , t ), then w e sa y f ( z + t ) h as a conv ergen t p ow er series expan s ion at t . W e will regularly u se the follo wing w eighte d L 1 norm k u k 1 ,s = p X j =1 | u j |k V j k s , u ∈ R p , s ≥ 1 . (5.2) Recall that it is assumed f rom the b eginning that V j 6 = 0 for all j . Therefore, k u k 1 ,s is ind eed a norm. Fi nally , if ( E , k · k ) is a normed linear space, then d enote by B ( u, a ; k · k ) = { v ∈ E : k v − u k < a } the sph ere cen tered at u ∈ E with radiu s a > 0 under the norm k · k , and by δ ( E ; k · k ) = inf { a : E ⊂ B ( u, a ; k · k ) for some u } . the inﬁ mum of the radii of spheres u nder the n orm k · k that c on tain E ⊂ E . 5.2 Single analytic disc W e ﬁrs t consid er the case where all f ( X ⊤ 1 u ), . . . , f ( X ⊤ n u ) hav e conv ergen t p ow er series expansions a t 0. The main result o f this sec tion is as follo w s. Theorem 5.1 Supp ose 0 ∈ I and d ( f , I ) > 0 . Fix ν ∈ (0 , 1) and θ ∈ (0 , 1) . Supp ose D = D ( I , n ( ν ) / 2) ∩ { u ∈ R p : k u k 1 , ∞ ≤ θ  ( f , 0) / 2 } in ( 4.1 ) and ǫ satisﬁes ( 2.3 ) for σ > 0 . Given q ∈ (0 , 1) , let λ p = ln[ p (1 + q − 1 )] . If β ∈ D , then the c onclusion of Prop osition 2.1 hold s with c 1 = σ p 2 λ p ∞ X k =1 " √ k | f ( k ) (0) | ( k − 1)! [ θ  ( f , 0)] k − 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k # , and c 2 as in Prop osition 4.1 . 12 If f is linear, then the exp ression of c 1 is simpliﬁed into c 1 = σ p 2 λ p | f ′ (0) | max 1 ≤ j ≤ p k V j k 2 / √ n. In the g eneral case, a s n − 1 / 2 k max j k V j k k ≤ max j k V j k ∞ , c 1 ≤ σ p 2 λ p K max 1 ≤ j ≤ p k V j k ∞ , with K = ∞ X k =1 √ k | f ( k ) (0) | ( k − 1)! [ θ  ( f , 0)] k − 1 . Since  ( f , 0) = ( lim k | f ( k ) (0) /k ! | 1 /k ) − 1 , it is easy to see that c 1 < ∞ . As n oted in Section 3.2 , under mild conditions, max j k V j k ∞ = O ( p ln( np )). Since λ p = O (ln p ) and K is a constan t, c 1 = O ( p ln( np ) ln p ). Th erefore, for reasonably large p , such as p = n a , c 1 = O ( √ ln n ). Moreo ver, as seen previously , und er mild conditions, it is p ossible that c 1 = O (ln n ). Combining the commen t after Prop osition 4.1 , it is seen that t he regression e stimator ( 4.1 ) can h a ve goo d precision. 5.3 Multiple analytic discs W e ﬁrst need some preparation. Let N ⊂ C b e a n o p en set con taining I suc h that f has an analytic e xtension on N . Let J = N ∩ R . F or u ∈ D ( J ), i = 1 , . . . , n , and k ∈ N , deﬁn e functions, a ik ( u ) = f ( k ) ( X ⊤ i u ) k ! , A k ( u ) = m ax 1 ≤ i ≤ n | a ik ( u ) | , r ( u ) = min 1 ≤ i ≤ n  ( f , X ⊤ i u ) . (5.3) It is easy t o see th at r ( u ) > 0. Giv en an y fu nction b ( u ) on D ( J ) satisfying 0 < b ( u ) < r ( u ) (5.4) and giv en an y set E ⊂ D ( J ), denote b ( E ) = inf u ∈ E b ( u ) , r ( E ) = inf u ∈ E r ( u ) , A k ( E ) = sup u ∈ E A k ( u ) . (5.5) If E is ﬁnite, th en it is easy to see that r ( E ) > b ( E ), and, by lim k | a ik ( u ) | 1 /k = 1 / ( f , X ⊤ i u ) for u ∈ D ( J ) and i = 1 , . . . , n , lim k →∞ A k ( E ) 1 /k = m ax u ∈ E 1 ≤ i ≤ n lim k →∞ | a ik ( u ) | 1 /k = 1 r ( E ) . (5.6) Let G b e a su bset of D ( J ). If E ⊂ [ u ∈ G O u , with O u = B ( u, b ( u ) / 2; k · k 1 , ∞ ) , (5.7) then G will b e referred to as a “ b/ 2-co v erin g grid”, or simply “co vering grid” for E . By th is deﬁ n ition, f or eac h p oint u in a cov ering grid and i = 1 , . . . , n , f is analytic 13 at X ⊤ i u with  ( f , X ⊤ i u ) > b ( u ). Note that a co v ering grid of E need n ot b e its subset. If E is compact, it alw a ys has a ﬁnite co vering g rid. Finally , for E ⊂ R p , denote C ( E ) = { (1 − s ) u + sv : s ∈ [0 , 1] , u, v ∈ E } , i.e., the union of all the line seg men ts connecting pairs of p oints in E . If E is b ound ed (resp . compact), th en C ( E ) is b oun ded (resp. compact). If | spt( u ) | ≤ a for ev er y u ∈ E , then | spt ( v ) | ≤ 2 a for ev ery v ∈ C ( E ). Ho wev er, C ( E ) ma y not b e con vex, and for u n b oun ded closed E , C ( E ) ma y not b e ev en closed. After all the preparation, the main result ca n b e stated as f ollo ws. Theorem 5.2 Supp ose I is compact and d ( I , f ) > 0 . Fix ν ∈ (0 , 1) . In the r e gr ession ( 4.1 ) , let D b e a close d subset of D ( I , n ( ν ) / 2) . Fix b ( u ) satisfying ( 5.4 ) . L et G b e a ﬁnite b/ 2 -c overing grid of C ( D ) . Given q ∈ (0 , 1) , let λ p = ln p (1 + q − 1 ) . If β ∈ D , then the c onclusion of Prop osition 2.1 holds with c 1 = √ 2 σ ∞ X k =1  k q ln | G | + k λ p A k ( G ) b ( G ) k − 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k  (5.8) and c 2 as in Prop osition 4.1 . T o get c 1 , it is enough to assume D is a compact subset of D ( J ). The s tr onger assumption that D ⊂ D ( I , n ( ν ) / 2) is needed in order to get b oth c 1 and c 2 . By Prop osition 2.3 , D ( I , n ( ν ) / 2) is compact. Therefore, if D ⊂ D ( I , n ( ν ) / 2) is closed, it is c ompact as well. Unlik e in T heorem 5.1 , here c 1 dep end s on | G | . In order for the regression estimator ( 4.1 ) to ha v e go o d p recision, | G | has to b e con trolled. The smaller | G | is, the higher the precision we can claim f or b β . T o see what might b e an acceptable lev el of | G | , observ e that c 1 ≤ √ 2 σ K q ln | G | + λ p max 1 ≤ j ≤ p k V j k ∞ = O  p ln( p | G | ) max j k V j k ∞  , where K = P k k 3 / 2 A k ( G ) b ( G ) k − 1 is ﬁnite by ( 5.6 ). F r om the commen t after Prop o- sition 4.1 , it is seen th at b β has go o d precision if p ln( p | G | ) max j k V j k ∞ = o ( √ n ). Pro vid ed m ax j k V j k ∞ = O ( p ln( np )) and p = n a , this imp lies there should b e ln | G | = o ( n/ ln n ). Certainly , | G | d ep ends on the c hoice of the s earch domain D in ( 4.1 ) and the prop ert y of f . W e n ext get some up p er b ounds of | G | . 5.4 Upp er b ounds on the cardinality of co vering gr id W e follo w the notation in Section 5.3 . Recall that f is analytic on some op en domain N ⊂ C con taining I = [ a, b ] and J = N ∩ R . The n ext result sa ys that | G | can b e as small as 1 in Theorem 5.2 . It follo ws directly from the deﬁn ition of co vering grid. 14 Prop osition 5.3 L et D ⊂ B ( w , d/ 2; k· k 1 , ∞ ) for some w ∈ D ( J ) and 0 < d < r ( w ) . Then for any b satisfying ( 5.4 ) and d < b ( w ) , { w } is a b/ 2 -c overing grid for C ( D ) . As an exa mple, if f is analytic in a neigh b orho o d o f 0 and k u k 1 , ∞ ≤ θ  ( f , 0) / 2 for all u ∈ D , wh ere 0 < θ < 1 / 2 , then, since r (0) =  ( f , 0), { 0 } is a b/ 2-co vering grid of C ( D ) for an y b s atisfying ( 5.4 ) w ith b (0) > θ  ( f , 0). W e next consider more general case s. F or ea se of notation, for E ⊂ R p and S ⊂ { 1 , . . . , p } , d enote δ ( E ) = δ ( E ; k · k 1 , ∞ ) and E S = { u ∈ E : spt( u ) ⊂ S } . Prop osition 5.4 Fix b ( u ) satisfying ( 5.4 ) and h ∈ N . L et D ⊂ D ( I , h/ 2) b e c omp act and K = C ( D ) . (1) If J = R and ¯ d b := inf u ∈ D ( J ) b ( u ) > 0 , then K has a b/ 2 -c overing grid with c ar dinality no gr e ater than X | S | = h : K S 6 = ∅  2 δ ( K S ) / ¯ d b + 1  h ≤  p h   2 δ ( D ) / ¯ d b + 1  h . (2) In gener al, if d b := inf u ∈ D ( I , h ) b ( u ) > 0 , then K has a b/ 2 -c overing grid with c ar dinality no gr e ater than X | S | = h : K S 6 = ∅ [4 δ ( K S ) /d b + 1] h ≤  p h  [4 δ ( D ) /d b + 1] h . Note th at, since I is compact, inf u ∈ D ( I , h ) r ( u ) ≥ inf x ∈ I  ( f , x ) > 0, so there are alw ays functions b ( u ) sat isfying ( 5.4 ) and d b > 0. F or example, b ( u ) = r ( u ) / 2. Finally , in Theorem 5.2 , c 1 dep end s on the c hoice of G , so it ma y n ot b e easy to use. Using the abov e b ound s on | G | , w e ha ve some more con v enien t c h oices for c 1 , although they are larger than the one in ( 5 .8 ). Prop osition 5.5 L et D b e a c omp act subset of D ( I , h/ 2) in r e g r ession ( 4.1 ) . (1) L et ¯ d k = su p x ∈ J | f ( k ) ( x ) | /k ! and ¯  0 = inf x ∈ J  ( f , x ) . Supp ose J = R , ¯  0 > 0 , and for any ¯  1 ∈ (0 , ¯  0 ) , sup | Im( z ) |≤ ¯  1 | f ′ ( z ) | < ∞ . Then the r adius of c onver genc e of P k ≥ 1 ¯ d k z k is ¯  0 and given ¯  1 ∈ (0 , ¯  0 ) , c 1 in ( 5.8 ) c an b e set e qual to c 1 = √ 2 σ ∞ X k =1  k q h ln( p ¯ Q ) + k λ p ¯ d k ¯  k − 1 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k  , (5.9) wher e ¯ Q = 2 δ ( D ) / ¯  1 + 1 . (2) L et d k = su p x ∈ I | f ( k ) ( x ) | /k ! and  0 = inf x ∈ I  ( f , x ) . Then  0 > 0 is e qual to the r adius of c onver genc e of P k ≥ 1 d k z k , and given  1 ∈ (0 ,  0 ) , c 1 in ( 5.8 ) c an b e set e qual to c 1 = √ 2 σ ∞ X k =1  k q h ln( pQ ) + k λ p d k  k − 1 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k  , (5.10) wher e Q = 4 δ ( D ) / 1 + 1 . 15 In ( 5 .9 ), b ecause the radius of con v ergence of P k ≥ 1 ¯ d k z k is ¯  0 , c 1 < ∞ . As λ p = ln[ p (1 + q − 1 )], c 1 = O ( √ h ln p max j k V j k ∞ ). Therefore, u nder mild conditions, for large n , as long as h is not to o large, the regression ( 4.1 ) still has go o d pr ecision. 5.5 Logistic regression with binary noise Let y 1 , . . . , y n b e the same random v ariables as in Section 3.3 . Ho w ev er, w e only see their randomly “ﬂipp ed” v ers ions z 1 , . . . , z n ∈ { 0 , 1 } , suc h that Pr { z 1 , . . . , z n | y 1 , . . . , y n } = n Y i =1 p y i z i , where p ab ≥ 0 and p a 0 + p a 1 = 1 for a = 0 , 1. Supp ose all p ab are kno wn. T h e regression mo d el no w is E ( z i ) = f ( X ⊤ i β ) with f ( t ) = p 01 + p 11 e t 1 + e t . If p 01 = p 11 , then z i is indep endent of y i with Pr { z i = 1 } = p 11 , making inference imp ossible. Therefore, w e will a ssume ∆ p = | p 11 − p 01 | > 0. Since f is analytic on C \ { t k , k ∈ Z } , wh ere t k = (2 k + 1) πi , we shall apply Prop osition 5.5 (1). First, since ǫ i = z i − E ( z i ) are in dep end en t an d | ǫ i | ≤ 1, they satisfy the tail assumption ( 2.3 ) with σ = 1. Since  ( f , x ) is the d istance fr om z to th e closest p ole, for any x ∈ R , ¯  0 = | 0 − t 1 | = π . Simp le calculation giv es f ′ ( t ) = ( p 11 − p 01 )[2 cosh( t/ 2)] − 2 . By 2 | cosh( a + bi ) | ≥ e | a | − e −| a | for a , b ∈ R , it is easy to see th at for y ∈ (0 , π ), M ( y ) := sup | Im z |≤ y | 2 cosh( z / 2) | − 2 < ∞ , Fix ¯  1 ∈ (0 , π ), r > 0 and ν ∈ (0 , 1). Let I = [ − r , r ] and D = { u ∈ R p : k u k 1 , ∞ ≤ r, | sp t( u ) | ≤ n ( ν ) / 2 } . Apparent ly , D ⊂ D ( I , n ( ν ) / 2) and δ ( D ) ≤ r , where, as in Prop osition 5.5 , δ ( D ) = δ ( D ; k · k 1 , ∞ ). Let θ ∈ ( ¯  1 /π , 1). F or an y x ∈ R and k ≥ 1, b y Cauch y’s contour in tegral, | f ( k ) ( x ) | k ! ≤ 1 2 k π I | z − x | = ¯  1 /θ | f ′ ( z ) | dz ( ¯  1 /θ ) k ≤ ∆ p M ( ¯  1 /θ ) k ( ¯  1 /θ ) k − 1 , giving ¯ d k ≤ ∆ p M ( ¯  1 /θ ) / [ k ( ¯  1 /θ ) k − 1 ]. Therefore, by Prop osition 5.5 (1), c 1 ≤ √ 2∆ p M ( ¯  1 /θ ) ∞ X k =1  p R + k λ p θ k − 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k  , 16 where R = n ( ν ) / 2 × ln(2 r p/ ¯  1 + p ). On the other hand, giv en r > 0, m ( r ) := inf x ∈ [ − r,r ] | 2 cosh( x/ 2) | − 2 > 0 . Therefore, by Prop osition 4.1 , c 2 ≥ ∆ 2 p m ( r ) 2 ν [1 + µ ( X )] n min 1 ≤ j ≤ p k V j k 2 2 . Similar to S ection 3.3 , if all th e entries of X are ± 1, then the results can b e simpliﬁed so that D = { u ∈ R p : P i | u i | ≤ r , | spt( u ) | ≤ n ( ν ) / 2 } , and c 1 ≤ √ 2∆ p M ( ¯  1 /θ ) ∞ X k =1 p R + k λ p θ k − 1 , c 2 ≥ ∆ 2 p m ( r ) 2 ν [1 + µ ( X )] . 6 T ec hnical details 6.1 Preliminary results Pr o of of Pr op osition 2.2 . (1) Let S = spt( u ). If | S | = 0 , then u = 0 and the inequalit y trivially holds. Supp ose | S | ≥ 1. S ince X u = P j ∈ S u j V j , k X u k 2 2 = X j ∈ S | u j | 2 k V j k 2 2 + X i, j ∈ S,i 6 = j u i u j V ⊤ i V j ≥ X j ∈ S | u j | 2 k V j k 2 2 − µ ( X ) X i, j ∈ S,i 6 = j | u i || u j |k V i k 2 k V j k 2 = [1 + µ ( X )] X j ∈ S | u j | 2 k V j k 2 2 − µ ( X )   X j ∈ S | u j |k V j k 2   2 . By Cauch y-Sc hw artz inequalit y , k X u k 2 2 ≥ (1 + µ ( X ) − µ ( X ) | S | ) X j ∈ S | u j | 2 k V j k 2 2 . Since | S | ≤ n ( ν ) = (1 − ν )[1 + 1 /µ ( X )], then 1 + µ ( X ) − µ ( X ) | S | ≥ ν [1 + µ ( X )], whic h implies t he d esired inequalit y . (2) By spt( u − v ) ⊂ spt( u ) ∪ spt( v ) and the assum ption, | sp t( u − v ) | ≤ n ( ν ). The inequalit y then follo w s fr om (1).  Pr o of of Pr op osition 2.3 . (1) Because I is closed and the mapping T : u → X u is con tinuous, D ( I ) = T − 1 ( I n ) is closed. Also, V h := { u ∈ R p : | spt( u ) | ≤ h } is closed. Th us D ( I , h ) = D ( I ) ∩ V h is closed. I t is easy to see that D ( I , h ) ⊂ D ( I , h ′ ) w hen h < h ′ . 17 (2) Because of (1 ), to show that D ( I , h ) is compact for h < n (0), it su ﬃces to sho w the set is b ound ed. Since h < n (0), there is ν ∈ (0 , 1) suc h th at h ≤ n ( ν ). Let u ∈ D ( I , h ) . Then | spt( u ) | ≤ n ( ν ), so by Prop osition 2.2 , k u k 2 2 ≤ k X u k 2 2 ν (1 + µ ( X )) min 1 ≤ j ≤ p k V j k 2 2 . Since X ⊤ i u ∈ I for eac h i , then k X u k 2 2 ≤ n m ax i | X ⊤ i u | 2 ≤ n sup x ∈ I | x | 2 . Because I is b ounded, it is se en k u k 2 2 is b oun ded for u ∈ D ( I , h ).  6.2 Exp onen t ial linear mo dels In this section, w e p ro ve the next tw o lemma. Lemma 6.1 Cond ition H1 is b y satisﬁe d ϕ = ( ϕ 1 , . . . , ϕ n ) with c 1 = σ r ln( p/q ) 2 n max 1 ≤ j ≤ p k V j k 2 . (6.1) Lemma 6.2 Cond ition H2 is satisﬁe d by G and ψ = ( ψ 1 , . . . , ψ n ) with c 2 = ν δ [1 + µ ( X )] 2 n min 1 ≤ j ≤ p k V j k 2 2 . (6.2) By P rop osition 2.1 , if c r = 3 c 2 1 /c 2 in ( 3.3 ), then ( 3.6 ) holds with κ r = 3 c 1 /c 2 . Therefore, once the lemmas are p ro ved, w e get the expressions of c r and κ r as in Theorem 3.1 . As in ( 3.4 ), let G ( x ) = x 1 + · · · + x n for x ∈ R n , and ϕ i ( z ) = z / 2, ψ i ( z ) = Λ( z ) − Λ ′ ( X ⊤ i β ) z for 1 ≤ i ≤ n a nd z ∈ R . Pr o of of L emma 6.1 . By ( 2 .3 ) and ǫ ⊤ V j = P n i =1 X ij ǫ i , Pr n | ǫ ⊤ V j | ≤ p 2 ln( p/q ) σ k V j k 2 , all j = 1 , . . . , p o ≥ 1 − p X j =1 Pr n | ǫ ⊤ V j | 2 > 2 ln( p/q ) σ 2 k V j k 2 2 o ≥ 1 − 2 q . Consequent ly , with probabilit y at le ast 1 − 2 q , k X ⊤ ǫ k ∞ = m ax 1 ≤ j ≤ p | ǫ ⊤ V j | ≤ p 2 ln( p/q ) σ max 1 ≤ j ≤ p k V j k 2 = 2 c 1 √ n, whic h implies c ondition H1 due to t he fact that for all u ∈ R p , |h ǫ, ϕ ( X u ) − ϕ ( X β ) i| = 1 2 |h ǫ, X u − X β i| = 1 2 | ( X ⊤ ǫ ) ⊤ ( u − β ) | ≤ 1 2 k X ⊤ ǫ k ∞ k u − β k 1 .  18 Pr o of of L emma 6.2 . Giv en u ∈ D ( I , n ( ν ) / 2), f or t ∈ [0 , 1], let h ( t ) = n X i =1 ψ i ((1 − t ) X ⊤ i β + tX ⊤ i u ) , whic h is we ll-deﬁned as (1 − t ) X ⊤ i β + tX ⊤ i u ∈ I . Let ∆ = G ( ψ ( X u ) − ψ ( X β )). Then ∆ = n X i =1 h ψ i ( X ⊤ i u ) − ψ i ( X ⊤ i β ) i = h (1) − h (0) . Observe that ψ ′ i ( X ⊤ i β ) = 0. Th en h ′ (0) = P i X ⊤ i ( u − β ) ψ ′ i ( X ⊤ i β ) = 0, so b y T a ylor expansion, ∆ = h ′′ ( τ ) / 2 for some τ ∈ (0 , 1). By ψ ′′ i ( z ) = Λ ′′ ( z ) and inf t ∈ I Λ ′′ ( t ) = δ > 0, ∆ = 1 2 n X i =1 [ X ⊤ i ( u − β )] 2 ψ ′′ i ((1 − t ) X ⊤ i β + tX ⊤ i u ) ≥ δ 2 n X i =1 [ X ⊤ i ( u − β )] 2 = δ k X ( u − β ) k 2 2 2 . By | spt( u − β ) | ≤ | sp t( u ) ∪ spt( β ) | ≤ n ( ν ) and Proposition 2.2 , ∆ ≥ δ ν [1 + µ ( X )] 2 p X j =1 | u j − β j | 2 k V j k 2 2 ≥ δ ν [1 + µ ( X )] 2 min 1 ≤ j ≤ n k V j k 2 2 × k u − β k 2 2 , and so C ondition H2 is satisﬁed with c 2 set as in ( 6 .2 ).  6.3 Pro ofs for LS r egression: the case of single analytic disc First, we establish Condition H2 . Pr o of of Pr op osition 4.1 . F or i = 1 , . . . , n and u ∈ D , since X ⊤ i β ∈ I and X ⊤ i u ∈ I , k f ( X u ) − f ( X β ) k 2 2 = n X i =1 | f ( X ⊤ i u ) − f ( X ⊤ i β ) | 2 ≥ n X i =1 d ( f , I ) 2 | X ⊤ i u − X ⊤ i β | 2 = d ( f , I ) 2 k X ( u − β ) k 2 2 . Since | spt( u − β ) | ≤ | spt( u ) | + | spt( β ) | ≤ n ( ν ), th en by Pr op osition 2.2 , k f ( X u ) − f ( X β ) k 2 2 ≥ d ( f , I ) 2 ν [1 + µ ( X )] min 1 ≤ j ≤ p k V j k 2 2 × k u − β k 2 2 . Because the righ t hand sid e is c 2 n k u − β k 2 2 , the pr o of is complete.  The main result in this section is Prop osition 6.5 , whic h together w ith Pr op osi- tion 4.1 immediately leads to Theorem 5.1 . F or brevity , in the rest of this section, w e shall d enote Π = { 1 , . . . , p } . 19 6.3.1 P o wer series expansion and tail assumption T o facilitate sub sequent discuss ions , w e ﬁr s t consid er ϕ ( x ) = ( ϕ 1 ( x 1 ) , . . . , ϕ n ( x n )) ⊤ , x = ( x 1 , . . . , x n ) ⊤ ∈ R n , where ϕ 1 , . . . , ϕ n are real-v alued fun ctions that ma y b e diﬀerent fr om eac h other. Supp ose eac h ϕ i can b e analytic ally extended to a neigh b orho o d of 0 in C . Let a ik = ϕ ( k ) i (0) k ! ∈ R , (6.3) Then  ( ϕ i , 0) = ( lim k | a ik | 1 /k ) − 1 . Since w e are inte rested in ϕ ( X u ) − ϕ ( X v ) instead of ϕ ( X u ) itsel f, w ithout loss of g eneralit y , let ϕ i (0) = 0. F or vec tor v = ( v 1 , . . . , v p ) ⊤ and k -tuple α = ( α 1 , . . . , α k ) ∈ Π k , d enote by v α the pro d uct of v α 1 , . . . , v α k . F or example, if p = 3 and k = 4, then v (1 , 3 , 1 , 2) = v 1 v 3 v 1 v 2 = v 2 1 v 2 v 3 . With this notation, for i = 1 , . . . , n , X iα = X iα 1 · · · X iα k . F or eac h j = 1 , . . . , p , let n j ( α ) = |{ i : α i = j }| . Clearly , n 1 ( α ) + · · · + n p ( α ) = k . By ( 6.3 ), for i = 1 , . . . , n , pro vid ed | X ⊤ i u | <  ( ϕ i , 0), ϕ i ( X ⊤ i u ) = ∞ X k =1 a ik ( X ⊤ i u ) k = ∞ X k =1 a ik   X α ∈ Π k X iα u α   . Therefore, if | X ⊤ i u | <  ( ϕ i , 0) for all i , then h ǫ, ϕ ( X u ) i = n X i =1 ǫ i ϕ i ( X ⊤ i u ) = ∞ X k =1 X α ∈ Π k u α n X i =1 ǫ i a ik X iα ! . (6.4) Lemma 6.3 Supp ose ǫ sat isfy ( 2.3 ) . L et q 1 , q 2 , . . . ≥ 0 with q := P k q k < 1 / 2 . Given r e al numb ers θ ik , 1 ≤ i ≤ n , 1 ≤ k ≤ p , c onsider the c ondition      n X i =1 ǫ i θ ik X iα      ≤ σ q 2 ln( p k /q k ) v u u t n X i =1 θ 2 ik X 2 iα , (6.5) wher e σ is the c onstant in ( 2.3 ) and ln 0 is deﬁne d to b e −∞ . Then Pr n ( 6.5 ) hol ds for all k ≥ 1 a nd α ∈ Π k o ≥ 1 − 2 q . (6.6) Pr o of. Th e left hand side of ( 6.6 ) is a t least 1 − ∞ X k =1 X α ∈ Π k Pr { ( 6.5 ) do es not h old for k and α } Since | Π k | = p k , it suﬃces to sho w that for eac h k and α = ( α 1 , . . . , α k ), Pr         n X i =1 ǫ i θ ik X iα      2 > 2 σ 2 ln( p k /q k ) n X i =1 θ 2 ik X 2 iα    ≤ 2 p − k q k , (6.7) whic h directly follo ws from ( 2.3 ).  20 6.3.2 Establishing Condition H1 Recall the f ollo wing m ultinomial formula: for a n y j = 1 , . . . , p , X α ∈ Π k n j ( α ) x n j ( α ) − 1 j Y s 6 = j x n s ( α ) s = k ( x 1 + · · · + x p ) k − 1 , (6.8) as the le ft han d sid e is equal to X k 1 + ··· + k p = k  k k 1 · · · k p  k j x k j − 1 j Y s 6 = j x k s s = ∂ ∂ x j   X k 1 + ··· + k p = k  k k 1 · · · k p  x k j j Y s 6 = j x k s s   = ∂  ( P i x i ) k  ∂ x j . F or eac h j = 1 , . . . , p , let ω j k = a 2 1 k X 2 k 1 j + · · · + a 2 nk X 2 k nj . (6.9) Lemma 6.4 Supp ose that, with θ ik = a ik , ( 6.5 ) holds for all k ≥ 1 and α ∈ Π k . Given u and v , let d j = | u j − v j | and m j = | u j | ∨ | v j | for j = 1 , . . . , p . If p X j =1 m j max 1 ≤ i ≤ n | X ij |  ( ϕ i , 0) < 1 , (6.10) then, letting ξ = h ǫ, ϕ ( X u ) − ϕ ( X v ) i , | ξ | ≤ σ √ 2 ∞ X k =1    k q ln( p k /q k )   p X j =1 m j ω 1 2 k j k   k − 1 p X j =1 d j ω 1 2 k j k    . (6.11) Pr o of. By ( 6.10 ), for any i , | X ⊤ i u | ≤ p X j =1 | u j X ij | ≤  ( ϕ i , 0) p X j =1 m j | X ij |  ( ϕ i , 0) <  ( ϕ i , 0) , and lik ewise | X ⊤ i v | <  ( ϕ i , 0). Therefore, b y ( 6.4 ), ξ = ∞ X k =1 X α ∈ Π k " ( u α − v α ) n X i =1 ǫ i a ik X iα # . 21 By the a ssumption, ( 6.5 ) holds with θ ik = a ik for all k ≥ 1 a nd α ∈ Π k . Th us | ξ | ≤ ∞ X k =1 X α ∈ Π k " | u α − v α |      n X i =1 ǫ i a ik X iα      # ≤ ∞ X k =1 σ q 2 ln( p k /q k )   X α ∈ Π k | u α − v α | p M α   (6.12) where M α = P n i =1 a 2 ik X 2 iα . Giv en k ≥ 1, for eac h α ∈ Π k , by n 1 ( α ) + · · · + n p ( α ) = k and Cauch y-Sc hw artz inequalit y , M α = n X i =1 a 2 ik p Y j =1 X 2 n j ( α ) ij ≤ p Y j =1 n X i =1 a 2 ik X 2 k ij ! n j ( α ) /k ≤ p Y j =1 ω n j ( α ) /k j k , where the l ast inequalit y is d ue to the notat ion in ( 6.9 ). On the other hand, | u α − v α | =       p Y j =1 u n j ( α ) j − p Y j =1 v n j ( α ) j       ≤ p X j =1    | u n j ( α ) j − v n j ( α ) j | j − 1 Y s =1 | v s | n s ( α ) n Y s = j +1 | u s | n s ( α )    ≤ p X j =1 n j ( α ) d j m n j ( α ) − 1 j Y s 6 = j m n s ( α ) s . Therefore, X α ∈ Π k | u α − v α | p M α ≤ X α ∈ Π k      p X j =1 n j ( α ) d j m n j ( α ) − 1 j Y s 6 = j m n s ( α ) s   p Y j =1 ω n j ( α ) / (2 k ) j k    = p X j =1 d j ω 1 2 k j k    X α ∈ Π k n j ( α )  m j ω 1 2 k j k  n j ( α ) − 1 Y s 6 = j  m s ω 1 2 k sk  n s ( α )    = k   p X j =1 m j ω 1 2 k j k   k − 1 p X j =1 d j ω 1 2 k j k , where the last equalit y is du e to the multinomial f ormula ( 6.8 ). Now by ( 6.12 ) , the inequalit y in ( 6.11 ) is p ro ved.  Prop osition 6.5 Fix θ ∈ (0 , 1) . L et D = { u ∈ R p : k u k 1 , ∞ ≤ θ  ( f , 0) / 2 } in Con- dition H1 and ǫ satisfy ( 2.3 ) f or σ > 0 . If β ∈ D , then Condition H1 is sat isﬁe d by setting c 1 as in Theorem 5.1 . 22 Pr o of. W e ha ve ϕ i = f and  ( ϕ i , 0) =  ( f , 0). F or u ∈ D , let d = u − β and m = ( m 1 , . . . , m n ) ⊤ , with m j = | u j | ∨ | β j | . Then k m k 1 , ∞ ≤ k u k 1 , ∞ + k β k 1 , ∞ ≤ θ  ( f , 0) . (6.13) As a result p X j =1 m j max 1 ≤ i ≤ n | X ij |  ( f , 0) = k m k 1 , ∞  ( f , 0) ≤ θ and ( 6.10 ) is satisﬁed. Let q k = ( q 1+ q ) k . Then P k q k = q , s o by Lemmas 6.3 and 6.4 , with probability at least 1 − 2 q , ( 6.1 1 ) holds. F or eac h k ≥ 1, by the notatio n in ( 6.9 ), ω j k = ( | f ( k ) (0) | /k !) 2 k V j k 2 k 2 k . Recall that in Theorem 5.1 , λ p is deﬁn ed to b e ln[ p (1 + q − 1 )]. S ince p ln( p k /q k ) = p k λ p , k q ln( p k /q k )   p X j =1 m j ω 1 2 k j k   k − 1 p X j =1 d j ω 1 2 k j k = p k λ p | f ( k ) (0) | ( k − 1)! × k m k k − 1 1 , 2 k k d k 1 , 2 k , (6.14) where the w eigh ted L 1 norm k · k 1 ,s is deﬁn ed in ( 5.2 ) and satisﬁes k u k 1 ,s ≤    n 1 /s k u k 1 , ∞ , max 1 ≤ j ≤ p k V j k s × k u k 1 , s ≥ 1 . Then by ( 6.13 ), k m k k − 1 1 , 2 k k d k 1 , 2 k ≤  n 1 2 k k m k 1 , ∞  k − 1 × max 1 ≤ j ≤ p k V j k 2 k × k d k 1 ≤ √ n [ θ  ( f , 0)] k − 1 × n − 1 2 k max 1 ≤ j ≤ p k V j k 2 k × k d k 1 . T ogether with ( 6.11 ) and ( 6.14 ) , this yields t he p ro of.  6.4 LS regression: m ultiple analytic disc case 6.4.1 Proof of Theorem 5.2 W e ﬁrst restate Lemma 6.3 a s follo ws. Lemma 6.6 L et ǫ satisfy ( 2.3 ) . L et E ⊂ D ( J ) b e ﬁnite and for k ≥ 1 and u ∈ E , let q k ,u ≥ 0 , such that q := P k P u ∈ E q k ,u < 1 / 2 . Consider the c ondition      n X i =1 ǫ i a ik ( u ) X iα      ≤ σ q 2 ln( p k /q k ,u ) v u u t n X i =1 a ik ( u ) 2 X 2 iα , (6.15) 23 wher e σ > 0 is the c onstant in ( 2.3 ) . Then Pr n ( 6.15 ) hold s for all k ≥ 1 , α ∈ Π k , and u ∈ E o ≥ 1 − 2 q . The next result p ro vid es a b ound on |h ǫ, f ( X u ) − f ( X v ) i| for suitable u and v . The metho d o f its p ro of is describ e at the end of Section 4 . Lemma 6.7 Given b ( u ) sat isfying ( 5.4 ) , let G b e a ﬁnite b/ 2 -c overing grid of a set K ⊂ D ( J ) . Fix q k ≥ 0 such that q := P k q k < 1 / 2 and ln q k = O ( k ) over J = { k ∈ N : A k ( G ) > 0 } . Supp ose that , with E = G and q k ,u = q k / | G | , ( 6 .15 ) holds for al l k ≥ 1 , α ∈ Π k , and u ∈ G . If u , v ∈ K and the entir e line se gment c onne cting them is in K , then, letting ξ = h ǫ, f ( X u ) − f ( X v ) i and d = v − u , | ξ | ≤ σ √ 2 n H ( b ( G ) , d ) (6.16) wher e H ( b ( G ) , d ) < ∞ , with H ( z , d ) = ∞ X k =1 k q ln | G | + ln( p k /q k ) A k ( G ) × n − 1 2 k k d k 1 , 2 k × z k − 1 . Pr o of. Sin ce G is ﬁnite, b ( G ) < r ( G ). Giv en η ∈ (0 , r ( G ) / b ( G ) − 1), let T =  2 k d k 1 , ∞ η b ( G )  . By the a ssumption, u + θ d ∈ K for θ ∈ [0 , 1]. F or t = 0 , . . . , T , let u ( t ) = u + td/T . Then u (0) = u , u ( T ) = v , and u ( t ) ∈ K . Fix t = 1, . . . , T . Then k u ( t ) − u ( t − 1) k 1 , ∞ = k d k 1 , ∞ /T ≤ η b ( G ) / 2 . By the d eﬁnition of G , w e can ﬁ nd some w ∈ G , su c h that k u ( t ) − w k 1 , ∞ ≤ b ( G ) / 2. Th en k u ( t − 1) − w k 1 , ∞ ≤ (1 + η ) b ( G ) / 2. Let ϕ ( x ) = ( ϕ 1 ( x 1 ) , . . . , ϕ n ( x n )) ⊤ , with ϕ i ( z ) = f ( z + X ⊤ i w ) − f ( X ⊤ i w ) , 1 ≤ i ≤ n. Let ˜ u = u ( t ) − w , ˜ v = u ( t − 1) − w . Then ϕ ( X ˜ u ) = f ( X u ( t ) ) − f ( X w ) , ϕ ( X ˜ v ) = f ( X u ( t − 1) ) − f ( X w ) , and, as sh o wn just no w, k ˜ u k 1 , ∞ ≤ (1 + η ) b ( G ) / 2 , k ˜ v k 1 , ∞ ≤ (1 + η ) b ( G ) / 2 . Let m = ( m 1 , . . . , m p ) ⊤ with m j = | ˜ u j | ∨ | ˜ v j | . F rom the ab ov e equalities we get k m k 1 , ∞ ≤ k ˜ u k 1 , ∞ + k ˜ v k 1 , ∞ ≤ (1 + η ) b ( G ) , (6.17) 24 and hence, by  ( ϕ i , 0) =  ( f , X ⊤ i w ) ≥ r ( w ), p X j =1 m j max 1 ≤ i ≤ n | X ij |  ( ϕ i , 0) ≤ k m k 1 , ∞ r ( w ) ≤ (1 + η ) b ( G ) r ( G ) < 1 . No w Lemma ( 6.4 ) can b e applied to ϕ , with u , v , and q k therein r ep laced with ˜ u , ˜ v , a nd q k / | G | , resp ectiv ely . Then    h ǫ, f ( X u ( t ) ) − f ( X u ( t − 1) ) i    ≤ σ √ 2 ∞ X k =1 k q ln( | G | p k /q k ) M ( t ) k , where M ( t ) k =   p X j =1 m j ω 1 2 k j k   k − 1 p X j =1 | u ( t ) j − u ( t − 1) j | ω 1 2 k j k , with ω j k = n X i =1 a 2 ik ( w ) | X ij | 2 k ≤ A 2 k ( G ) k V j k 2 k 2 k ≤ n A 2 k ( G ) k V j k 2 k ∞ . Since u ( t ) − u ( t − 1) = d/T , it follo ws that M ( t ) k ≤   p X j =1 m j n 1 2 k A 1 k k ( G ) k V j k ∞   k − 1 × A 1 k k ( G ) T p X j =1 d j k V j k 2 k = √ n A k ( G ) T k m k k − 1 1 , ∞ × n − 1 2 k k d k 1 , 2 k ≤ √ n A k ( G ) T [(1 + η ) b ( G )] k − 1 × n − 1 2 k k d k 1 , 2 k , where the l ast inequalit y is d ue to ( 6.17 ). Consequen tly , | ξ | ≤ T X t =1    h ǫ, f ( X u ( t ) ) − f ( X u ( t − 1) ) i    = H ((1 + η ) b ( G ) , d ) . By ( 5.6 ) and ln q k = O ( k ) o ver J , the radiu s of con ve rgence of the p o w er series deﬁning g ( z ) = H ( z , d ) is r ( G ) > b ( G ). As (1 + η ) b ( G ) < r ( G ), w e can let η → 0 and apply d ominated con v ergence. The pro of is then complete.  Prop osition 6.8 In Condition H1 , let D b e a c omp act subset of D ( J ) . Supp ose ǫ satisﬁes ( 2.3 ) for some σ > 0 . L et G b e a ﬁnite b/ 2 -c overing g rid of C ( D ) . If β ∈ D , then C ondition H1 is satisﬁe d by setting c 1 as in Th eorem 5.2 . Pr o of. Sin ce C ( D ) is compact, it indeed h as a ﬁn ite b/ 2-co vering grid, justifying the assump tion on G . As in the p r o of of Prop osition 6.5 , let q k = ( q 1+ q ) k . Then by Lemmas 6.6 and 6 .7 , with probabilit y at least 1 − 2 q , ( 6 .16 ) holds. Th e rest of the pro of follo ws th at for Prop osition 6.5 and hence is omitted for brevit y .  25 Pr o of of The or em 5.2 . First, b y D ⊂ D ( I , n ( ν ) / 2) and d ( I , f ) > 0, Prop osition 4.1 c an b e app lied to yield c 2 . Second, C ( D ) is compact and since I is an in terv al, C ( D ) ⊂ D ( I ). Then C ( D ) ⊂ D ( J ). Prop osition 6.8 can b e applied to K = C ( D ) to get c 1 .  6.4.2 Other tec hnical results Pr o of of Pr op osition 5.4 . Because D ⊂ D ( I , h/ 2) and is compact, K = C ( D ) ⊂ D ( I , h ) and is compact. First, ﬁx S with | S | = h and K S 6 = ∅ . Let ψ S : R p → R S b e the natural pro jection and ı S : R S → R p the immersion, suc h that ı S ( y ) = z ∈ R p , with z j = y j for j ∈ S and z j = 0 for j 6∈ S . Deﬁne the w eight ed L 1 norm k · k S on R S suc h th at k u k S = P j ∈ S | u j |k V j k ∞ . F or ease of notation, d en ote B S ( w, a ) = B ( w , a ; k · k S ) and δ S ( E ) = δ ( E ; k · k S ). Like wise, denote B ( w, a ) = B ( w, a ; k · k 1 , ∞ ) and δ ( E ) = δ ( E ; k · k 1 , ∞ ). Fix d > 0. Later w e will set d to sp eciﬁc v alues. Let E = ψ S ( K S ). It is easy to ve rify that δ S ( E ) = δ ( K S ). By simple geometric argument, it is seen that E can b e co vered b y n o more th an [ δ ( K S ) /d + 1] h spheres B S ( ˜ u k , d ), with eac h one in tersecting with E . Let u k = ı S ( ˜ u k ). In case (1 ), let d = ¯ d b / 2. By J = R , f is analytic at eve ry X ⊤ i u k . Then, by K S = ı S ( E ) ⊂ [ k ı S ( B ( ˜ u k , d )) ⊂ [ k B ( u k , d ) ⊂ [ k B ( u k , b ( u k ) / 2) , u 1 , . . . , u m is a b/ 2-co v ering grid of K S . In ca se (2), Let d = d b / 4. Since f ma y not b e analytic at ev ery X ⊤ i u k , we cannot directly take u 1 , . . . , u m as a co vering grid. F or eac h i = 1 , . . . , m , c ho ose an arbitrary ˜ w k ∈ B S ( ˜ u k , d ) ∩ E and let w k = ı S ( ˜ w k ). As w k ⊂ K S , f is analytic at ev ery X ⊤ i w k . It is easy to c hec k that B S ( ˜ w k , 2 d ) con tains B S ( ˜ u k , d ). Therefore, K S = ı S ( E ) ⊂ [ k ı S ( B ( ˜ w k , 2 d )) ⊂ [ k B ( w k , 2 d ) ⊂ [ k B ( w k , b ( u k ) / 2) , so w 1 , . . . , w m is a b/ 2-co v ering grid of K S . Denote b y G S the co v ering grid as ab o v e in either ca se. As K = S | S | = h K S , G = S | S | = h : K S 6 = ∅ G S is a b/ 2-co v ering grid of K and | G | ≤ X | S | = h : K S 6 = ∅ | G S | W e already kno w | G S | ≤ [ δ ( K S ) /d + 1] h . By δ ( K S ) ≤ δ ( K ) = δ ( D ), | G S | ≤ [ δ ( D ) /d + 1] h . Finally , th er e are at m ost  p h  subsets S with | S | = h and K S 6 = ∅ . T he pro of for the b ounds on | G | is t hus complete .  26 Pr o of of Pr op osition 5.5 . (1) If c > ¯  0 , then there is t ∈ J such th at  ( f , t ) < c . Since lim k | f ( k ) ( t ) /k ! | 1 /k =  ( f , t ) − 1 , lim k →∞ ¯ d k c k ≥ lim k →∞ | f ( k ) ( t ) | c k k ! = ∞ . Therefore, the radius of con v ergence of P k ≥ 1 ¯ d k z k is at most ¯  0 . T o sho w th at the radiu s of conv ergence is ¯  0 , it suﬃces to sho w that ¯ d k c k is b ounded for an y c ∈ (0 , ¯  0 ). By assump tion M := sup | Im( z ) |≤ c | f ′ ( z ) | < ∞ . Fix x ∈ R . F or an y z with | z − x | = c , | Im( z ) | ≤ c . Th erefore, b y Cauc h y’s conto ur in tegral, | f ( k ) ( x ) | k ! =      1 2 k π √ − 1 I | z − x | = c f ′ ( z ) dz ( z − x ) k      ≤ 1 2 k π I | z − x | = c | f ( z ) | dz | z − x | k ≤ M k c k − 1 . T ak e suprem um o ver x ∈ R . Then w e g et ¯ d k c k ≤ M /c < ∞ for all k ≥ 1. F rom the deﬁnitions in ( 5.3 ), it is clea r that A k ( u ) ≤ ¯ d k and r ( u ) ≥ ¯  0 for u ∈ D . Give n an y ¯  1 ∈ (0 , ¯  0 ), let b ( u ) ≡ ¯  1 . By P rop osition 5.4 (1), there is a b/ 2-co v ering grid G for C ( D ) with | G | ≤ p h (2 δ ( D ) / ¯  0 + 1) h . Therefore, c 1 can b e set as in ( 5 .9 ). (2) F or eac h x ∈ I ,  ( f , x ) > 0. Since I is compact, it is co v ered by a ﬁ nite n umber of interv als ( x i −  ( f , x i ) / 2 , x i +  ( f , x i ) / 2). Let c = min i  ( f , x i ) / 2. Then c > 0. F or any x ∈ I , there is x i suc h that | x − x i | <  ( f , x i ) / 2. Th en for any z ∈ C with | z − x | < c , | z − x i | <  ( f , x i ) and hence f is analyti c at z . As a result, f is analytic in the d isc cen tered at x with r adius c , an d so  ( f , x ) ≥ c . This leads to  0 = inf x ∈ I  ( f , x ) ≥ c . F or c ∈ (0 ,  0 ), since I c = { z ∈ C : | z − x | ≤ c for some x ∈ I } is compact, M := sup z ∈ I c | f ′ ( z ) | < ∞ . Using Cau ch y’s cont our inte gral as in (1), it can b e sho w n that  0 is the radius of conv ergence of P k ≥ 1 d k z k . Th e r est of (2) can be p ro ved follo wing the argumen t for (1).  References Ak aik e, H. (1974), ‘A new lo ok at the statistica l mo del iden tiﬁcation’, IEEE T r ans. Autom atic Contr ol A C -19 , 7 16–723 . Sy s tem identiﬁca tion and time-series anal- ysis. Bunea, F., Ts ybak ov, A. & W egk amp, M. (2007), ‘Sp ars it y oracle inequalities for the Lasso’, E le ctr on. J. Stat. 1 , 169–194 (electronic). Cand` es, E. J. & P lan, Y. (2009), ‘Near-ideal mo del selection b y ℓ 1 minimization’, Ann. Stat ist. 37 (5 A), 2145–2177 . Cand` es, E. J. & T ao, T . (2007), ‘The Dan tzig selector: statisti cal estimation when p is muc h larger than n ’, Ann. Statist. 35 (6), 2313–235 1. 27 Donoho, D. L., Elad, M. & T emly ak ov, V. N. (2006), ‘Stable reco very of sparse o v er- complete represen tations in the presence of n oise’, IEEE T r ans. Inform. The ory 52 (1), 6–18. Efron, B., Hastie, T., John stone, I. & Tibshirani, R. (2004 ), ‘Least angle regression’, Ann. Stat ist. 32 (2 ), 407–499. With discus sion, and a rejoinder b y the auth ors. Field, D. J. (199 4), ‘What is the goa l of sensory co ding?’, Neur al Comput. 6 (4), 559 – 601. Huang, C., Cheang, G. H. L. & Barron, A. R. (2008) , Risk of p enalized least squares, greedy select ion and ℓ 1 p enalization for ﬂexible function libraries, T ec hnical Re- p ort 06-10, Y ale Universit y , Department of Statistics. Natara jan, B. K. (1995), ‘S parse appro ximate solutions to linear systems’, SIAM J. Comput. 24 (2), 227–234. P ollard, D. (1984), Conver genc e of sto chastic pr o c esses , Spr inger Series in Statistics, Springer-V erlag, New Y ork. Rudin, W. (1987), R e al and c omplex an alysis , third ed n, McGra w-Hill Bo ok Co., New Y ork. Sc hw arz, G. (1978), ‘Estimating th e d imension of a mo del’, Ann. Statist. 6 (2), 461– 464. Sharp ee, T. O., Miller, K. D. & Stryker, M. P . (2008) , ‘On the imp ortance of static nonlinearit y in estimating spatiotemp oral neural ﬁlters with natural stim u li’, J. Neur ophysiol. 99 (1), 2 496–25 09. Sharp ee, T. O., Rust, N. C. & Bialek, W. (2004), ‘Analyzing n eural resp onses to natural signals: maximally informative dimensions’, Neur al Comput. 16 , 223–250 . W asserman, L. & Roeder, K. (200 9), ‘High-dimens ional v ariable selection’, Ann. Statist. 37 (5A), 2178–2201 . Zhang, T. (2009), ‘Some sharp p erformance b oun ds for least squ ares regression with l 1 regularization’, Ann. Statist. 37 (5A), 2109–2144. Zhao, P . & Y u , B. (2006), ‘On m o del selection consistency of Lasso’, J. Mach. L e arn. R e s. 7 , 2541–2 563. 28

$L_0$ regularized estimation for nonlinear models that have sparse underlying linear structures

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment