Consistent selection via the Lasso for high dimensional approximating regression models

IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 122–137 c  Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 01 Consis ten t selecti on via t he L asso for hi gh dimension al appro ximating regr ession mo de ls Floren tina Bunea ∗ 1 Florida State Univ ersity Abstract: In t his article w e inv estigate consistency of selec tion in regression models via the popular Lasso m ethod. Here w e dep art from the traditional linear regression assumption and consider appro ximations of the regression function f with elements of a given dictionary of M f unctions. The target f or consistency is the index set of those functions from this dictionary that realize the most parsimonious appro ximation to f among all linear com binations b e- longing to an L 2 ball cen tered at f and of radius r 2 n,M . In this framework we sho w that a consistent estimate of this i ndex set can b e derived via ℓ 1 penal- ized least squares, with a data dependent penalty and with tuning sequence r n,M > p log( M n ) /n , where n is the sample size. Our r esults hold for any 1 ≤ M ≤ n γ , for an y γ > 0. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 1.1 Beyond linea r regr e s sion . . . . . . . . . . . . . . . . . . . . . . . . . 125 2 Consistent selectio n via ℓ 1 pena lized least squa res . . . . . . . . . . . . . . 12 6 2.1 Main re s ult: consistent subs e t selection . . . . . . . . . . . . . . . . . 1 2 7 2.2 Pr o of of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 29 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 1. Introduction In this pap er w e show that the p opula r L asso technique can be used for co nsistent feature selec tio n in hig h dimensional appr oximating regr ession mo dels. W e consider the follo wing framework. Giv en a random pair ( X, Y ), we let f ( x ) = E ( Y | X = x ) be the conditional mean function, henceforth called the reg ression function. W e aim to r econstruct c o nsistently a spa r se appr oximation o f f via linear combinations of elements of a giv en dictionar y of functions F = { f 1 , . . . , f M } . This re c onstruction will be base d on ( X 1 , Y 1 ) , . . . , ( X n , Y n ), a sample of independent r andom pair s dis- tributed as ( X, Y ) ∈ ( X , ℜ ), where X is a Borel subset of ℜ d ; all functions f j are deﬁned o n X . Our a im express es the b elief that, in many instances, even if M is ∗ Supported in part by NSF Gran ts DMS- 04-06049 and DMS-07-06829. 1 Departmen t of Statistics, Fl orida Sta te Univ ersit y , T allahassee, Florida, USA, e-mail: flori@st at.fsu.e du AMS 2000 subje c t classiﬁc ations: Pr imary 62G08; secondary 62C20, 62G05, 62G20. Keywor ds and phr ases: consistency , high dimension, Lasso, l 1 regularization, regression, penalty, s election. 122 Consistent sele ction via the L asso 123 large, only a subset of F may b e neede d to a pproximate f well. If that is the cas e, it may b e of int erest to determine whether this set ca n b e estimated consis ten tly via a computationally eﬃcient method. The fo cus of this w ork is on co ns istent selec tio n via the Las s o when the s iz e of F grows p olynomially with the sample size n , that is M = n γ , for any γ > 0 . W e b egin by giving a n um ber of examples of dictionar ies F and as so ciated con- sistency issues . 1. If d = M and f j ( X ) = X j for all j , one may be interested in identifying the subset of v a riables with linear combinations clos e to f . A familiar particular case is linear reg ression, where one assumes that f ( X ) = λ ′ X , with λ ∈ ℜ M having non-zero compo nents in p ositions corres po nding to a set J ∗ ⊆ { 1 , . . . , M } . Here we depa rt from this traditional equality assumption and consider the mo re re- alistic case wher e f is not equa l to, but can be well approximated by a linear combination of the given v aria bles. W e discuss this in detail in the next section. 2. Another problem of interest is that of ﬁnding cons is tent ly a spars e linear approx- imation of f realized with elements from a la rge list of M p ossibly comp eting estimators. These estimates may corr esp ond to M diﬀerent metho ds of estima - tion, may b e computed fro m M diﬀerent samples with the same mea n function, or may corr esp ond to M diﬀerent v alues o f a tuning para meter o f the same metho d. Ins ta nces o f the latter ar is e in kernel based metho ds that requir e the choice of a gr id of v a lues for the bandwidth par ameter or in Bay esian methods, where the sp eciﬁcation of a g rid of v alues for hyper-par ameters is needed. A consistent identiﬁcation of a subset of the estimates in these ex amples would v alidate the use of a particular restriction on an initially lar ge grid. In such sit- uations, when the elemen ts of F ar e estimator s, we will a s sume tha t they have bee n computed on s amples indep endent of the one used for subset selection and treat them her e as ﬁxed functions. 3. A la st example is the nonpara metric estimation of f from a co llection o f M giv en basis functions, where only a subset may realize a go o d approximation of f , as describ ed in the follo wing subse ction. There exis t a num ber of mo del sele c tio n metho ds that yield consistent s ubs et selection in reg ression mo dels. In disc us sing them a num ber of distinctions a r e needed. The ﬁrst o ne per tains to the evolution o f the liter ature on model sele c tion tec h- niques in regr ession. One impo rtant cut-oﬀ p o in t in this evolution seems to b e the computational complexity of a particular metho d and, within this, the size of M relative to n plays a crucial role. If M ≤ n , pro c e dures based on v a rious information criteria o ccupy an imp ortant place. They are referred to now as the BIC/AIC-type metho ds; we mention here the semina l works of ([ 1 ], [ 15 ]), the unifying theo ry of [ 2 ], and, v arious g eneralizations of these metho ds ([ 4 ], [ 7 ]). Such pro ce dures can be ea sily implemented for small to mo der ate M . F or larg er v alues of M multiple testing pro cedures, in particula r of the FDR type (e.g., [ 3 ], [ 9 ]), or cross -v alida tion with all its v ariant s (holdout v alidation, K -fold) [ 21 ], a re p opular, but b ecome more co mputatio nally complex as M increases. If M > n these tec hniques may be- come computatio na lly int ractable, unless they are used as par t of a multiple-stage scheme. F or a further ov erview on computational asp ects in model selection, fr om a Bayesian p ers pe c tive, see [ 11 ]. Whereas the ab ov e mentioned methods ca n still be us e d in very particular r e- gressio n mo dels when M > n , for instance, for se q uence-space mo dels, where mo del selection via BIC is equiv alen t to hard thresholding, they t ypically fail, computa- 124 F. Bune a tionally , when M is large. A standa rd solution in this case is to seek es tima tes that solve a certain class of conv ex optimizatio n problems. Among the mo st p opula r estimates of this type in r egressio n is the p enaliz e d least squar e s es tima te with an ℓ 1 -type p enalty (Lasso), which we describ e in detail in the next section. In a Bay esian framework it can b e derived from a Ga ussian likelihoo d with a La place prior. Two imp ortant aspe cts set the ℓ 1 regular iz ed (Las s o) type estimators apart: they a re ea s y and fast to compute; see [ 8 ], [ 13 ], [ 14 ], [ 18 ], a mong others, for eﬃcient algorithms; and, if M > n , some comp onents of the estimate will be set to z e ro, in ﬁnite sa mples, see, e.g., [ 13 ]. Therefore , via a one-step eas ily implementable pro- cedure, one obtains s ubset selection even if M > n . T o date, this metho d (or its v aria nts) is the mo st widely used in r egressio n proble ms o f very high dimension, esp ecially when dimension reduction is of int erest. The second distinction in discussing consistency of selec tion in regr ession is re- lated to the targ et for consistency . C o nsistency of selection has b een studied for a ll aforementioned techniques only in the following co nt ext, which we term par ametric: the target for selection is typically a n index se t J ∗ corres p o nding to the non-zer o true reg ression co eﬃcien ts, whereas the remaining co e ﬃcie n ts are assumed to b e exactly zero. An estimation metho d that uses the data a nd all M elements f j to yield a subset ˆ I of indice s such that P ( ˆ I = J ∗ ) → 1 for la rge n is ca lled a consistent metho d of selectio n. In light of thes e t wo distinctions we give b elow a summar y of the e xisting re- sults on consistency of selection. They ha ve a ll b een established for the traditiona l parametric tar get J ∗ . If M ≤ n a nd under appro priate assumptions all the ab ov e metho ds, o r close v aria nts, yie ld co ns istent subset selection for the par ametric ta r get J ∗ . Refere nc e s include those for AIC/BIC-t yp e metho ds ([ 4 ], [ 10 ], and [ 22 ], among o thers), multiple testing pr o cedures [ 5 ], cro ss-v a lidation pro c edures [ 16 ], and Lasso-type pro cedures [ 24 ]. If M > n consistency o f selec tion has o nly b een studied for Las so-type estimator s . Again, in the existing liter ature, the tar get is the standar d tar get J ∗ . The r e sults a re limited. Meinshaus en and Buhlmann [ 12 ] show ed tha t P ( ˆ I = J ∗ ) → 1 in Gaus sian graphical mo dels, under ass umptions that ar e tailor ed to mo dels for which, in our notations, ( Y , X 1 , . . . , X M ) ∼ N (0 , Σ). Consistency of se lection ha s b een established when M > n , for ﬁxed design linear regress ion mo dels and a target set J ∗ that corres p o nds to co eﬃcients λ ∗ j that ar e assumed to b e low er b ounded by a sequence of o rder O ( n − δ/ 2 ), for 0 < δ < 1 [ 23 ]. Similar r esults, under slightly diﬀerent assumptions, ha ve also been obtained for a three stage procedur e [ 20 ]: in the ﬁrst stage Lasso estimates are computed for a num b er o f v alues of the tuning pa r ameter, in the second step cross- v alidation is p erfor med to select one Lasso estimate, and in the thir d o ne the mo del is r eﬁtted on the v a riables pr esent in the selected Lass o estimate. W e also refer to a related notion o f consistency , in ﬁxed design r e gressio n with Gaussian err ors [ 19 ]. If M > n consistent subset selection via the Las s o ha s no t b een inv estigated, to the b est o f our knowledge, in the general fr amework we describe in detail below. Within this framework, w e extend the existing results to more gener a l regre ssion mo dels on a ra ndom design and a mor e g eneral target index set. Consistent sele ction via the L asso 125 1.1. Beyond li ne ar r e gr ession Despite its practical app ea l, the s tudy of selection pro cedur es that are co nsistent for tar get sets other than the cla ssical one has r eceived v ery little a tten tion. Our target set will b e deﬁned relative to linear approximations of f with elements of F with resp ect to the L 2 ( ν ) norm k · k , where we denote the proba bilit y mea sure of X by ν . F ormally , deﬁne (1.1) Λ =    λ ∈ ℜ M : k M X j =1 λ j f j − f k 2 ≤ C f r 2 n,M    , where C f > 0 is a consta nt depending only o n f and r n,M is a p os itive s equence that conv erges to zero and whic h will be sp eciﬁed in the next section. In what follows we as sume that Λ is no t void. F or any λ ∈ ℜ M we le t J ( λ ) denote the index set corresp onding to the non-zero comp onents of λ a nd denote by M ( λ ) its cardinality . Let k ∗ = min { M ( λ ) : λ ∈ Λ } . W e deﬁne our target vector (1.2) λ ∗ = ar gmin    k M X j =1 λ j f j − f k 2 : λ ∈ R M , M ( λ ) = k ∗    . Let I ∗ = J ( λ ∗ ) denote the index set corr esp onding to the non-zero elements of λ ∗ and note that I ∗ has cardinality k ∗ . Thus f ∗ = P j ∈ I ∗ λ ∗ j f j provides the s parsest approximation to f that can b e rea lized with λ ∈ Λ and, in pa rticular, (1.3) k f ∗ − f k 2 ≤ C f r 2 n,M . This motiv a tes our treating I ∗ as the targ et index set. W e note that if one assumes, as in standar d linear r egress ion mo dels , that f ( x ) = P M j =1 λ j x j = P j ∈ I ∗ λ ∗ j x j = f ∗ ( x ), where λ ∗ j denotes the non- zero comp onents of λ , then ( 1.3 ) is trivially s atisﬁed fo r any p os itive sequence r n,M . Ther e fo re, the classical target J ∗ is a particula r case of ours. In or der to ensure that λ ∗ captures the essential featur es of f in a pars imonious wa y we require that its co mpo nent s not be unnecessar ily small, otherwise we can place their indices outside I ∗ . F ormally , we will require that the following condition holds. Condition (C). Ther e exists B > 0, indep endent of n or M , suc h that min j ∈ I ∗ | λ ∗ j | > B r n,M . W e sho w below tha t ℓ 1 pena lized least squares can b e used to estimate consis- ten tly the new target I ∗ , even if M is larg er than n , in particular if it grows a s n γ , for any γ > 0, under minimal assumptions on the dictiona ry F and appropria te choices for r n,M . In Section 2 b e low we in tro duce the estimate and discuss these choices. Section 2.1 con tains o ur main result, Theorem 2.1, together with a discus- sion of the assumptions under which it holds. The pro o f of the main result is given in Section 2.2 and intermediate results are proved in the App endix. 126 F. Bune a 2. Co nsistent s election via ℓ 1 p enalized l east squares W e estimate the set I ∗ of the previous section via ℓ 1 pena lized lea s t squar es. W e ﬁrst compute b λ = ar g min λ ∈ℜ M    1 n n X i =1 { Y i − M X j =1 λ j f j ( X i ) } 2 + p en( λ )    , (2.4) where pen( λ ) = 2 M X j =1 ω n,j | λ j | with ω n,j = r n,M k f j k n , (2.5) for a sequence r n,M given below, where we write k g k 2 n = n − 1 P n i =1 g 2 ( X i ) for any function g : X → ℜ . W e note that e a ch λ j in the p enalty term has a diﬀerent, data- depe ndent, weigh t. The estimate ˆ λ th us obtained is in o ne-to-one corr esp ondence with the follo wing estimate. F or each 1 ≤ j ≤ M deﬁne θ j = 2 ω n,j λ j and let A b e the M × M diagona l matrix with dia gonal entries 2 ω n,j . Next o bserve that F λ = F 1 θ , where F is the n × M matrix with entries f j ( X i ), F 1 = F A − 1 and θ = Aλ . Thus, denoting by Y the n dimensional vector with entries Y i , the problem reduces to ca lculating b θ = a rg min θ ∈ℜ M 1 n ( Y − F 1 θ ) ′ ( Y − F 1 θ ) + M X j =1 | θ j | , for whic h the aforementioned fast algorithms can be used. Then, we compute our sought so lutio n ˆ λ = A − 1 ˆ θ . W e let ˆ I denote the index s e t corr esp onding to the non-zero comp onents of ˆ λ . W e show in the next subsection that P ( ˆ I = I ∗ ) → 1 when n → ∞ . W e b egin by noticing that we alwa ys hav e P ( ˆ I = I ∗ ) ≥ 1 − P ( I ∗ 6⊆ ˆ I ) − P ( ˆ I 6⊆ I ∗ ) . Therefore, pr oving that ˆ I is consistent reduces to s howing that each of the pr o b- abilities in the rig ht-hand side o f the inequality ab ov e converge to zer o. In what follows we motiv ate choices for the s e quence r n,M that s tem from suﬃcient co n- ditions under which this co nv e rgence is ac hiev ed. The pro ofs are presented in the next section. W e be gin by noticing that if ˆ λ → λ ∗ , with probability conv erging to o ne, then I ∗ 6⊆ ˆ I with probability conv erging to ze r o. T o s ee this, further note that if compo nent -wise consistency of ˆ λ ho lds, we will estimate al l non-zero element s o f λ ∗ by no n-zero sequences, but we ma y also estimate some of its zer o comp o nents by so me small, but non-zero sequences . In light o f this fact, a ﬁrst set of restric tio ns on r n,M will b e such that ˆ λ is close to λ ∗ , in the sense b elow. It follows immediately (b y [ 5 ], Theor em 2.3 ; see the App endix b elow for a full formulation) that, with high probability r n,M | ˆ λ − λ ∗ | 1 ≤ D {k f − f ∗ k 2 + k ∗ r 2 n,M } , for some p ositive co nstant D , and where | a | 1 = P M j =1 | a j | denotes the ℓ 1 norm of any vector in ℜ M . Next, notice that the optimal par a metric r a te of conv ergence Consistent sele ction via the L asso 127 for a comp o nent ˆ λ j of ˆ λ is of o rder 1 / √ n , and it can b e achiev ed if we knew I ∗ of cardinality k ∗ < M in adv a nc e . How ev er, this is not known, so the b est we can do is mimic this b ehavior in our context. W e can do this by choos ing r n,M of or der 1 / √ n , where we recall that we have ass umed that k f − f ∗ k 2 ≤ r 2 n,M . Notice further that this choice is optimal for the ra te of conv ergence of ˆ λ , which is not the focus here. Indeed, mor e mo dest ra tes of co nv e r gence o f ˆ λ can be considered when consistency of selection is of main imp ortance. W e discuss in detail tw o concrete c hoices, and defer a complete analysis for future w ork. One can consider r n,M = A p log( M n ) /n , for an appr opriately large constant A > 0. Notice that this c hoice diﬀers from the one that yields the o ptimal rate only by logar ithmic factors, which are needed to accommo date dictionar ie s with M > n . With this choice, the targ e t s et I ∗ corres p o nds to linear combinations of the elements of F that belo ng to , up to log a rithmic factors, a 1 / √ n neighborho o d of f , with resp ect to the L 2 ( ν ) norm. This provides o nly a slight departur e from the standard linea r model assumption and sta ndard tar g et index set J ∗ . It is therefore not surpr ising that, in this case, our tuning sequence r n,M is also compa r able to the one co ns idered in pa r ametric mo dels ([ 12 ], [ 23 ]), wher e a sequence of the order of 1 /n 1 / 2 − θ , θ ∈ (0 , 1 / 2), is emplo y ed. W e note that this c hoice is slig htly conser - v ative, and can b e relaxed to O ( p log( M n ) /n ) in our framework, and therefore, as a particula r case, in theirs. In or der to accommo date consistent selection in a purely nonparametric frame- work w e need to increase the size o f r n,M . F or instance, if all f j are estimates of f , and r n,M is as before, the set Λ deﬁned in ( 1.1 ) may b e e mpt y , a s non-pa rametric estimates of f ha v e typically slow er ra tes than 1 / √ n . W e therefore consider target sets I ∗ corres p o nding to L 2 ( ν ) neighbor ho ods around f of radius r 2 n,M , now with r n,M = O  (log( M n ) /n ) 1 / 4  . In this case, the set Λ given in ( 1.1 ) abov e is not empt y if at le a st one o f the estimato rs f j has, up to logarithmic factors, a rate of the order n − 1 / 4 , which is a mo dest rate to require. Of cour se, if f j ( X ) = X j , as in linear regress ion, this choice means that w e may b e conten t with a coarser approximation than b efore. How ev er, note that this approximation has the b eneﬁt of b eing realized with a smaller num b er of v a riables and that this may inc r ease the int erpretability of that particula r mo del and be a desirable pr op erty in pr actical situations. The results presented b elow ho ld for either of these choice, in par ticular for any r n,M ≥ A p log( M n ) /n , and we will therefo re not distinguish b etw een them. 2.1. Main r esult: c onsistent subset sele cti on W e beg in by listing and commenting o n the assumptions under which o ur r e s ult holds. The ﬁrst assumption r efers to the error terms W i = Y i − f ( X i ). W e r e call that f ( X ) = E ( Y | X ). Assumption (A1). The ra ndom variables X 1 , . . . , X n ar e indep endent, identic al ly distribute d r andom vari ables with pr ob abi lity me asur e µ . The r andom variables W i ar e indep endently distribute d with E { W i | X 1 , . . . , X n } = 0 and E { exp( | W i | ) | X 1 , . . . , X n } ≤ b for some ﬁnite b > 0 and i = 1 , . . . , n. 128 F. Bune a W e also imp os e mild conditions on f and on the functions f j . Let k g k ∞ = sup x ∈ X | g ( x ) | for an y function g on X . Assumption (A2). (a) Ther e exists 0 < L < ∞ such that k f j k ∞ ≤ L for al l 1 ≤ j ≤ M . (b) Ther e ex ist s c 0 > 0 such that k f j k ≥ c 0 for al l 1 ≤ j ≤ M . (c) Ther e exists L 0 < ∞ such t hat E [ f 2 i ( X ) f 2 j ( X )] ≤ L 0 for al l 1 ≤ i , j ≤ M . (d) Ther e ex ist s L 1 < ∞ such t hat k f k ∞ ≤ L 1 < ∞ . (e) Ther e exists L ∗ < ∞ such that k f − f ∗ k ∞ ≤ L ∗ . Remark 2.1. W e note that ( a ) trivially implies ( c ). Howev er, as the implied b o und may be to o la rge, we opted for stating ( c ) sepa rately . Note also that ( a ) a nd ( d ) imply the following: fo r any ﬁxed λ ∈ ℜ M , there exis ts a p ositive constant L ( λ ), depe nding on λ , such that k f − P M j =1 λ j f j k ∞ = L ( λ ). Insp ection of the pro of of Theorem 2.1 b elow shows tha t we can allow L ∗ to grow very slowly with n . How ever, for sake o f clarity in pr esentation we opted fo r treating it as ﬁxed. Assumption (A3). L et ρ M ( i, j ) = < f i , f j > k f i kk f j k , wher e < f i , f j > = E f i ( X ) f j ( X ) and k f i k = E 1 / 2 f 2 i ( X ) . Assume that max i ∈ I ∗ max j 6 = i | ρ M ( i, j ) | ≤ C k ∗ , for some c onstant C > 0 . Remark 2.2. F o llowing [ 6 ], C = 1 / 45 is an allowable c hoice. Other choices a re po ssible, but improvemen t of consta nts is beyond the scope of this paper . Remark 2.3. Assumption (A3) reﬂects the b elief that the corr e la tions b etw een functions f j with j ∈ I ∗ and functions f j with j / ∈ I ∗ should b e small. Howev er, we a llow the corr e lations outside I ∗ to be arbitrary . W e note that this assumption replaces the standard orthonormality assumption on the design matrix: it is given in terms of theoretical quantities a nd it ca n hold even if M > n . It can be chec ked in practice by r eplacing the theor etical correlations by sa mple correlatio ns . W e denote by G the ev ent that the n × M matrix F with entries f j ( X i ) has full rank. T o av oid additional technicalities, the results of this pap er ca n b e reg arded as conditional o n G . Otherwis e, a ll the results can b e re-derived by in tersecting all the relev ant even ts with G a nd G c , under the additional assumption that P ( G c ) is appropria tely sma ll. W e c a n now state o ur main result which we prov e in the nex t subsection. Theorem 2.1 . If assumptions (A1)–(A3) and c onditio n (C) hold, and k ∗ r n,M → 0 then P ( ˆ I = I ∗ ) → 0 . Remark 2 .4. The conv ergence ab ove holds either if M is ﬁxed and n → ∞ or if bo th M , n → ∞ , if r n,M ≥ A p log( M n ) /n for a n appropriately large c onstant A . Therefore we obtain consistency for b oth choices of r n,M discussed a b ove. In our deriv a tio ns we require that M do e s not grow fas ter than a p ow er of n . Remark 2.5. The condition r n,M k ∗ → 0 impo ses re strictions on the s ize of k ∗ . If r n,M = O ( p log( M n ) /n ) the theor em ab ove shows that we can recover consis- ten tly subs e ts o f size k ∗ = O ( √ n/ log n ), up to o ther loga rithmic factors. The Consistent sele ction via the L asso 129 choice r n,M = O (log ( M n ) /n ) 1 / 4 corres p o nds to a coarser approximation o f f than before, and the r estriction on the num b er of approximating functions is no w k ∗ = O ( n 1 / 4 / log n ). 2.2. Pr o of of The or em 2.1 Recall that P ( ˆ I = I ∗ ) ≥ 1 − P ( I ∗ 6⊆ ˆ I ) − P ( ˆ I 6⊆ I ∗ ) . Therefore, pro ving that ˆ I is co nsistent r educes to showing that each of the proba- bilities in the r ight hand s ide of the ineq uality a b ove conv erge to zero. W e present this in the following tw o pro p ositions. W e defer the pro of o f the intermediate re sults to the App endix. Prop ositi on 2.2. If assumptions (A1)–(A3) a nd c ondition (C) ho ld, and r n,M k ∗ → 0 , t hen P ( I ∗ 6⊆ ˆ I ) → 0 as n → ∞ , fo r any r n,M ≥ A p log( M n ) /n , with A > 0 lar ge enough. Pr o of. W e follow the sa me r easoning as [ 4 ]. Let c n = min k ∈ I ∗ | λ ∗ k | and recall that c n > B r n,M , by condition (C). Therefore P ( I ∗ 6⊆ ˆ I ) ≤ P ( j / ∈ ˆ I for some j ∈ I ∗ ) ≤ P ( | b λ j − λ ∗ j | = | λ ∗ j | ) ≤ P ( | b λ j − λ ∗ j | > c n ) → 0 , a s n → ∞ where, in the second ineq uality , we used that b λ j = 0 for j / ∈ ˆ I , by the deﬁnition of ˆ I . The last inequality follows from Coro llary 1 prese nted in the Appendix below. Prop ositi on 2. 3. If assumptions (A1)–(A3) hold and r n,M k ∗ → 0 , then P ( ˆ I 6⊆ I ∗ ) → 0 , as n → ∞ , for any r n,M ≥ A p log( M n ) /n , with A > 0 lar ge enough. Pr o of. Let h ( µ ) = 1 n n X i =1 { Y i − X j ∈ I ∗ µ j f j ( X i ) } 2 + 2 r n,M X j ∈ I ∗ || f j || n | µ j | , and deﬁne ˜ µ = arg min µ ∈ℜ k ∗ h ( µ ) . (2.6) Let B = \ k / ∈ I ∗    | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | < 2 r n,M || f k || n    . Let ˜ λ ∈ ℜ M be the vector that has the comp onents of ˜ µ in p os itions cor resp onding to the index set I ∗ and comp onents equal to zer o otherwis e. Thus, by abuse of notation, ˜ λ = ( ˜ µ, 0) . F rom Lemma 3.4 in the Appendix it follows that, on the set B , ˜ λ is a so lutio n of ( 2.4 ). Reca ll that ˆ λ is a so lutio n of ( 2.4 ) by constructio n. Then, by a rguments similar to those used in ([ 13 ], Theor ems 3.1 and 3.2) re garding the 130 F. Bune a closeness of tw o solutions it follows that, on the set B , ˆ λ k = 0 for k ∈ I ∗ c . Therefore ˆ I ⊆ I ∗ on the set B . Hence P ( ˆ I 6⊆ I ∗ ) ≤ P ( B c ) = P   [ k ∈{ 1 ,...,M }\ I ∗    | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ 2 r n,M || f k || n      ≤ X k ∈{ 1 ,... ,M }\ I ∗ P      | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ 2 r n,M || f k || n      . Let k ∈ { 1 , . . . , M } \ I ∗ be ﬁxed. Deﬁne the sets E 1 ( k ) = ( 1 n | n X i =1 W i f k ( X i ) | < r n,M k f k k n / 2 ) , E 2 ( k ) =  k f k k 2 n ≥ 1 4 k f k 2  , E 3 ( k ) = ( | 1 n n X i =1 f j ( X i ) f k ( X i ) | ≤ 2 |h f j , f k i| + δ n,M , j ∈ I ∗ ) , where δ n,M = 2 C L 2 r n,M will b e sp eciﬁed b elow. The choice of δ n,M is pur ely techn ical and do es not aﬀect the ov erall results. Let ˜ f = P j ∈ I ∗ ˜ µ j f j . Reca ll that λ ∗ ∈ R M given by ( 1.2 ) has zero comp onents in po sitions cor resp onding to indices in I ∗ c , by deﬁnition. Let µ ∗ be the vector in ℜ k ∗ obtained fro m λ ∗ by deleting these zer os. Therefore f ∗ = P M j =1 λ ∗ j f j = P j ∈ I ∗ µ ∗ j f j . By successive applicatio ns of the triang le inequality a nd s ince k f k k n ≤ L , for all k ∈ I ∗ c , b y assumption (A2) ( a ), we obtain: P   1 n | n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ r n,M k f k k n   (2.7) ≤ P 1 n | n X i =1 W i f k ( X i ) | ≥ r n,M k f k k n / 2 ! + P 1 n | n X i =1 ( f ( X i ) − ˜ f ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 2 ! ≤ P ( E c 1 ( k )) + P 1 n | n X i =1 ( f ∗ ( X i ) − ˜ f ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4 ! + P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ r n,M k f k k n / 4 L ! ≤ P ( E c 1 ( k )) + P   | ( X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4   + P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ r n,M k f k k n / 4 L ! . Consistent sele ction via the L asso 131 T o b ound the second term in the last inequalit y a bove we ﬁrst notice that on the set E 3 ( k ) a nd under assumptions (A2) (a) and (A3) w e ha ve | X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≤ 2 X j ∈ I ∗ | ˜ µ j − µ ∗ j ||h f j , f k i| + δ n,M X j ∈ I ∗ | ˜ µ j − µ ∗ j | ≤ 2 C L 2 k ∗ | ˜ µ − µ ∗ | 1 + δ n,M | ˜ µ − µ ∗ | 1 . Therefore, on E 2 ( k ) ∩ E 3 ( k ), and under ass umptions (A2), (a) and (b), and (A3) we have P ( | ( X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4) ≤ P ( | ˜ µ − µ ∗ | 1 ≥ c 0 32 C L 2 k ∗ r n,M ) + P ( | ˜ µ − µ ∗ | 1 ≥ c 0 16 r n,M δ − 1 n,M ) ≤ 2 P ( | ˜ µ − µ ∗ | 1 ≥ c 0 32 C L 2 k ∗ r n,M ) , (2.8) for n lar ge eno ug h, s ince the assumption k ∗ r n,M → 0 implies that k ∗ r n,M ≤ 1 for large n , and we recall that we deﬁned δ n,M = 2 C L 2 r n,M . Lastly , notice that on the s e t E 2 ( k ) a nd under assumption (A2) ( b ) and ( e ) the third term o f the last inequality in display ( 2.7 ) can b e b o unded by (2.9) P ( 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ) . T o complete the pro of w e need to show that P ( E c 1 ( k )) , P ( E c 2 ( k )) and P ( E c 3 ( k )) and the pro babilities in ( 2.8 ) and ( 2.9 ), when summed over k ∈ { 1 ., . . . , M } \ I ∗ , conv erge to zero as n → ∞ . W e show this in Lemma 3 .5 , Cor ollary 2 and L e mma 3.6 , resp ectively , in the App e ndix below. This completes the pro o f of this res ult. App endix In order to show Pr op osition 2.2 and to bo und ( 2.8 ) ab ov e we will use t wice ([ 6 ], Theorem 2 .3 page 1 77) and w e beg in by stating it here, for completeness. F o r any λ ∈ ℜ M we let J ( λ ) denote the index set corres po nding to the non-zero comp onents of λ and deno te by M ( λ ) its ca rdinality . Let ρ ( λ ) = max i ∈ J ( λ ) max j 6 = i | ρ M ( i, j ) | . With Λ given b y ( 1.1 ) in Section 1 .1, let Λ 1 = { λ ∈ Λ : ρ ( λ ) ≤ C / M ( λ ) } . Theorem 2.3 ([ 6 ]) . Assume that (A1) and (A2) hold. The n the ℓ 1 p enalize d le ast squar es estimator ˆ λ giv en by ( 2.4 ) satisﬁes, for any λ ∈ Λ 1 (3.10) P n | b λ − λ | 1 ≤ B 1 r n,M M ( λ ) o ≥ 1 − π n,M ( λ ) , wher e π n,M ( λ ) ≤ 14 M 2 exp − c 1 n min ( r 2 n,M L 0 , r n,M L 2 , 1 L 0 M 2 ( λ ) , 1 L 2 M ( λ ) )! exp  − c 2 M ( λ ) L 2 ( λ ) nr 2 n,M  132 F. Bune a for some p ositive c onst ants c 1 , c 2 dep ending on c 0 , C f and b only, and a c onstant B 1 dep ending on c 0 and C f . Notice now that by ( 1 .3 ) a nd under a ssumption ( A 3), λ ∗ ∈ Λ 1 . W e therefor e hav e the following corollar y . Corollary 1. Assume that (A1)–(A3) hold. Then P n | b λ j − λ ∗ j | > B 1 r n,M o ≤ π ∗ , for al l 1 ≤ j ≤ M , wher e π ∗ = π n,M ( λ ∗ ) . Pr o of. F rom ([ 6 ], Theorem 2.3) we o btain 1 − π ∗ ≤ P n | b λ − λ ∗ | 1 ≤ B 1 k ∗ r n,M o ≤ P  min 1 ≤ j ≤ M | b λ j − λ ∗ j | ≤ B 1 r n,M  . This immediately implies the result. Remark 3.1. Notice that π ∗ → 0 as n → 0 for any r n,M ≥ A p log( M n ) /n , and for B = B 1 , as needed in Prop osition 2.2 in Section 2.2 ab ove. In order to control the proba bilit y ( 2.8 ) we ﬁrst deﬁne U a nd U 1 , the analogues of the sets Λ and Λ 1 deﬁned ab ov e. U =    µ ∈ ℜ k ∗ : k f − X j ∈ I ∗ µ j f j k 2 ≤ C f r 2 n,M    , U 1 = { µ ∈ U : ρ ( µ ) M ( µ ) ≤ C } . Recall that µ ∗ is the v ector in ℜ k ∗ obtained from λ ∗ by deleting the zero entries. Then, since assumption (A3) implies max i ∈ I ∗ max j ∈ I ∗ ,j 6 = i | ρ M ( i, j ) | ≤ C /k ∗ and k f − P M j =1 λ ∗ j f j k = k f − P j ∈ I ∗ µ j f j k we deduce that µ ∗ ∈ U 1 . T he r efore, using again ([ 6 ], Theore m 2 .3 ) a pplied now to the dictio nary { f j } j ∈ I ∗ and quantit y ˜ µ deﬁned in ( 2.6 ) ab ov e, we obtain the following corollary : Corollary 2. Assume that (A1)–(A3) hold. Then (3.11) P {| ˜ µ − µ ∗ | 1 ≤ B 2 k ∗ r n,M } ≥ 1 − p ∗ , wher e p ∗ ≤ 14 k ∗ 2 exp − c 1 n min ( r 2 n,M L 0 , r n,M L 2 , 1 L 0 k ∗ 2 , 1 k ∗ L 2 )! + exp  − c 2 k ∗ L 2 ( λ ∗ ) nr 2 n,M  , for some p ositive c onstants c 1 , c 2 as ab ove and a c onstant B 2 > 0 that only dep ends on C f and c 0 . Remark 3.2. If r n,M ≥ A p log( M n ) /n , then M p ∗ → 0 a s n → ∞ , for A > 0 large enough. Hence, the probability giv en by ( 2.8 ), summed ov er k , converges to zero for b oth choices of r n,M int ro duced in Section 2, adjusting the v alue of B 2 if needed. The following lemma is needed in the b eginning of the pro of of Pro p osition 2.3 . Consistent sele ction via the L asso 133 Lemma 3. 4. ˜ λ = ( ˜ µ, 0) is a solution of ( 2.4 ) on the set B = \ k / ∈ I ∗          2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i )       < 2 r n,M || f k || n    . Pr o of. W e reca ll that for any conv ex function g : ℜ M → ℜ the sub diﬀerential of g at a point λ is the set D λ = { w ∈ ℜ M : g ( u ) − g ( λ ) ≥ h w , u − λ i} . Let g ( λ ) = 1 n P n i =1 { Y i − P M j =1 λ j f ( X i ) } 2 + p en ( λ ), wher e we recall that our p enalty term is p en( λ ) = 2 r n,M P M j =1 k f j k n | λ j | . Then (e.g., [ 13 ]) we hav e D λ = { w ∈ ℜ M : w = − 2 n F ′ ( Y − F λ ) + 2 r n,M v } , where v ∈ ℜ M is such that v k = || f k || n , if λ k > 0 v k = −|| f k || n , if λ k < 0 v k ∈ [ −|| f k || n , || f k || n ] , if λ k = 0 , and where we reca ll that Y = ( Y 1 , . . . , Y n ) and F is the n × M matrix with elements f j ( X i ). By sta ndard results in conv ex analysis, ¯ λ ∈ ℜ M is a po int of lo cal minim um for a conv ex function g if and only if 0 ∈ D ¯ λ , where 0 ∈ ℜ M . Therefore, ¯ λ minimizes our g ( λ ) if and only if 0 ∈ D ¯ λ if and only if      2 n F ′ ( Y − F ¯ λ )  k     = 2 r n,M | v k | for all k ∈ { 1 , . . . , M } , where ( · ) k ab ov e deno tes the k - th compo nent of the vector in paranthesis. Equiv a- lent ly , ¯ λ minimizes g ( λ ) if and only if, for all 1 ≤ k ≤ M       2 n n X i =1 [ Y i − M X j =1 ¯ λ j f j ( X i )] f k ( X i )       = 2 r n,M || f k || n , if ¯ λ k 6 = 0 , (3.12)       2 n n X i =1 [ Y i − M X j =1 ¯ λ j f j ( X i )] f k ( X i )       ≤ 2 r n,M || f k || n , if ¯ λ k = 0 . In what follows we ﬁnd co nditions under which ˜ λ = ( ˜ µ, 0), with ˜ µ given in ( 2.6 ) ab ov e, satisﬁes ( 3 .12 ). Firs t notice that, b y deﬁnition, P n i =1 [ Y i − P M j =1 ˜ λ j f j ( X i )] = P n i =1 [ Y i − P j ∈ I ∗ ˜ µ j f j ( X i )]. Since ˜ µ is a s olution of ( 2.6 ) then, by the ab ove standa r d results in conv ex analysis, applied no w to the function h ( λ ) deﬁned in the pro of of Prop ositio n 2.3 , the following ho ld       2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i )       = 2 r n,M || f k || n , if ˜ λ k = ˜ µ k 6 = 0 , k ∈ I ∗ ,       2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i )       ≤ 2 r n,M || f k || n , if ˜ λ k = ˜ µ k = 0 , k ∈ I ∗ . 134 F. Bune a Notice now that on the set B we also hav e       2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i )       ≤ 2 r n,M || f k || n , if k / ∈ I ∗ (for which ˜ µ k = 0) . The a bove displays show that ˜ λ satisﬁes c o ndition ( 3.12 ) and is there fo re a solution of ( 2.4 ) on B . Remark 3.3. The observ ation tha t cons titutes the sta temen t of the above lemma has als o b een made e ls ewhere [ 12 ] for a slightly diﬀerent p enalty term. W e hav e included here a full deriv ation of it for completeness and clarity . T o complete the pro of of Prop os ition 2 .3 we will ma ke rep eated use of Ber ns tein’s inequality , which we s tate here for co mpleteness. Bernstein’s inequality . L et ζ 1 , . . . , ζ n b e indep endent r andom variables s uch that 1 n n X i =1 E | ζ i | m ≤ m ! 2 w 2 d m − 2 for some p ositive c onst ant s w and d and for al l inte gers m ≥ 2 . Then, for any ε > 0 we ha ve (3.13) P ( n X i =1 ( ζ i − E ζ i ) ≥ nε ) ≤ exp  − nε 2 2( w 2 + dε )  . Lemma 3. 5. L et assumptions (A1) and (A2) hold . Then X k ∈{ 1 ,...,M }\ I ∗ P ( E c 1 ( k )) → 0 , X k ∈{ 1 ,...,M }\ I ∗ P ( E c 2 ( k )) → 0 , a nd X k ∈{ 1 ,... ,M }\ I ∗ P ( E c 3 ( k )) → 0 , a s n → ∞ . Pr o of. T o show P k ∈{ 1 ,... ,M }\ I ∗ P ( E c 1 ( k )) → 0 it is enough to show that ( I ) = P k ∈{ 1 ,...,M }\ I ∗ P ( E c 1 ( k ) ∩ E 2 ( k )) → 0 and tha t ( I I ) = P k ∈{ 1 ,...,M }\ I ∗ P ( E c 2 ( k )) → 0. The pro ofs follow immediately fro m Bernstein’s inequality a nd the union bound. They are the same as ([ 6 ], proofs of Lemmas 4 and 5, pag e 1 86). W e include here the derived pr obability b ounds, for completeness. ( I ) ≤ 2 M 2 exp − nr 2 n,M 16 b ! + 2 M 2 exp  − nr n,M c 0 8 √ 2 L  + 2 M 2 exp  − nc 2 0 12 L 2  , and ( I I ) ≤ M 2 exp  − nc 2 0 12 L 2  . Consistent sele ction via the L asso 135 T o bound the last q ua nt it y in the statement of the Lemma notice ﬁrst that P ( E c 3 ( k )) ≤ 2 X j ∈ I ∗ P 1 n n X i =1 f j ( X i ) f k ( X i ) > 2 |h f j , f k i| + δ n,M ! ≤ 2 X j ∈ I ∗ exp  − n 4 L 0 ( |h f j , f k i| + δ n,M ) 2  + 2 X j ∈ I ∗ exp n − n 4 L ( |h f j , f k i| + δ n,M ) o ≤ 2 M exp ( − nδ 2 n,M 4 L 0 ) + 2 M ex p  − nδ n,M 4 L  . The second inequality of the display ab ov e follows from Bernstein’s inequality with ζ i = f j ( X i ) f k ( X i ), for every ﬁxed j , and k and with w 2 = L 0 , d = L 2 , for ǫ = |h f j , f k i| + δ n,m , use d tog ether with the inequality e x/a + b ≤ e x/ 2 a + e x/ 2 b for all x, a and b . Therefor e, for δ n,M = 2 C L 2 r n,M we obtain ( I I I ) = X k ∈{ 1 ,... ,M }\ I ∗ P ( E 3 2 ( k )) ≤ 2 M 2 exp ( − C 2 L 4 nr 2 n,M L 0 ) + 2 M 2 exp  − C Lnr n,M 2  . Thu s, the qua ntities ( I ) , ( I I ) and ( I I I ) conv erge to zero for any r n,M ≥ A p log( M ) n/n . Lemma 3. 6. L et assumptions (A1) and (A2) hold . Then ( I V ) = X k ∈{ 1 ,... ,M }\ I ∗ P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ! → 0 . Pr o of. By the Cauch y -Sch w artz inequa lit y w e hav e P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ! (3.14) ≤ P 1 n n X i =1 ( f ( X i ) − f ∗ ( X i )) 2 ≥ c 2 0 64 L 2 r 2 n,M ! ≤ P n X i =1 { ( f ( X i ) − f ∗ ( X i )) 2 − k f − f ∗ k 2 } (3.15) ≥ n ( c 2 0 64 L 2 r 2 n,M − k f − f ∗ k 2 )  ≤ P n X i =1 { ( f ( X i ) − f ∗ ( X i )) 2 − k f − f ∗ k 2 } ≥ C 1 nr 2 n,M ! , where we r ecall that k f − f ∗ k 2 ≤ C f r 2 n,M , by deﬁnition a nd C 1 = c 2 0 / 64 L 2 − C f , where w e a s sume that w e hav e a lready adjusted C f to have C 1 > 0, by taking an appropria te constant A in the deﬁnition of r n,M , if needed. The pro of follows 136 F. Bune a immediately from Bernstein’s inequa lity applied to ζ i = ( f ( X i ) − f ∗ ( X i )) 2 , with w = p C f r n,M and d = L ∗ , and for ǫ = C 1 r 2 n,M . Therefor e ( I V ) ≤ M exp {− C f C 2 1 4 nr 2 n,M } + M exp {− C 1 4 L ∗ nr 2 n,M } , and b oth terms co nv e rge to zero for e ither choice of r n,M . References [1] Akaike, H. (1974). A new lo o k a t the statistical mo del identiﬁcation. IEEE T r ans. Automat. Contr ol 19 716–72 3. MR04237 16 [2] Barron, A., Birg ´ e, L. and Massar t, P. (1999). Risk bounds for mo del selection via p enalization. Pr ob ab. The ory R elate d Fields 113 301– 413. MR16790 28 [3] Benjamini, Y. and Hochberg, Y. (199 5). Con trolling the F alse Discov- ery Rate: a Pra ctical a nd Po w erful Approach to Multiple Hyp o thesis T esting . J. R oy. Statist. So c. Ser. B 57 2 8 9–30 0. MR13253 92 [4] Bunea, F. (2004). Consistent co v ariate selection and po st mo del selection inference in semipara metric regressio n. Ann. Statist. 32 898 –927 . MR20651 93 [5] Bunea, F., Wegkamp, M. H . a n d Auguste, A. (200 6). Consistent v a riable selection in hig h dimensiona l r egress io n via m ultiple testing. J. Statist. Plann. Infer enc e 136 434 9–436 4. MR23234 20 [6] Bunea, F., Tsybako v, A. B. and Wegkamp, M. H. (2007 ). Spar sity or a cle inequalities for the Las so. Ele ctr onic J . Statist. 1 169–19 4 . MR23121 49 [7] Chakrabar ti, A. and G hosh, J. K. (2006). A generaliza tion of BIC for the gener a l expo ne ntial families. J. Statist. Plann. Infer enc e 136 2847–2 872. MR22812 34 [8] Efron, B., Hastie, T. , Johnstone, I. and Tibshirani, R. (2004). Leas t angle regr ession. Ann. S tatist. 32 407 –451 . MR20601 66 [9] Genovese, C. and W asserman, L. (2004 ). A Sto chastic Pro cess Appr o ach to F alse Disco very Ra tes. Ann. Statist. 32 10 35–1 0 61. MR2065 197 [10] Guyon, X. and Y ao, J. (199 9). On the underﬁtting and overﬁtting sets of mo dels c hosen b y order selection criteria. J . Mult ivariate Anal. 70 221–3 15. MR17115 22 [11] Lahiri, P., ed. (20 01). Mo del Sele ction. Institute of Mathematic al S tatistics L e ctur e Notes – Mono gr aph Series 3 8 . IMS, Beach w o o d, O H. MR20 00750 [12] Meinshausen, N. and B ¨ uhlmann, P. (2006). High- dimensional graphs a nd v aria ble selection with the Las so. A nn. S tatist. 34 143 6–14 6 2. MR22783 63 [13] Osborne, M. R., P resnell, B. and Turlach, B. A. (2000a ). On the la sso and its dual. J. Comput. Gr aph. Statist. 9 319– 337. MR18220 89 [14] Osborne, M. R., Presnell, B. and Turlach, B. A . (2000b). A new approach to v aria ble selection in least sq ua res problems. IMA J. Numer. Anal. 20 389– 4 04. [15] Schw a rz , G. (1978). Estimating the dimension of a mo del. Ann. Statist. 6 461–4 64. MR04680 14 [16] Shao, J. (199 3 ). Linea r mo del s election by cr oss v alidation. J. A mer. Statist. Asso c. 8 88 486– 4 94. MR1224 373 [17] Tibshirani, R. (1 9 96). Regress ion shrink age a nd selection via the L asso. J. R oy. Statist. So c. Ser. B 58 2 6 7–28 8. MR137 9242 [18] Turlach, B. A. (2005). O n algorithms for solving least squares problems under an L1 p ena lt y or an L1 co nstraint. 2004 Pr o c e e di ngs of the Americ an Consistent sele ction via the L asso 137 Statistic al As so ciation, St atistic al Co mputing Se ction [CD-ROM] 2572– 2577 . American Statistical Asso ciation, Alexa ndria, V A. [19] W ainwright, M. J. (2007). Information-theor etic limits o n spa rsity recov ery in the hig h-dimensional and no isy setting. T echnical repor t, Dept. Statistics, UC Berkeley . [20] W asserman, L. and Roeder, K. (20 07). Hig h dimensional v ariable se le c- tion. T ech nical rep ort, Dept. Statistics, Carnegie Mello n Univ. [21] Wegkamp, M. H. (2003). Model selection in nonpa r ametric r egressio n. Ann. Statist. 31 252 –273. MR19625 06 [22] Woodr oofe, M. (1982). O n mo del selection and the arcsine laws. Ann. Statist. 10 11 82–11 94. MR06736 53 [23] Zhao, P . and Yu, B. (200 7 ). On mo del selection consistency of La s so. J. Ma- chine L e arning R ese ar ch 7 25 41–2 567. MR22744 49 [24] Zou, H. (200 6). The adaptive lass o and its oracle pro pe rties. J. Amer. St atist. Asso c. 1 01 1418 – 1429 . MR2279 469

Consistent selection via the Lasso for high dimensional approximating regression models

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment