Consistent selection via the Lasso for high dimensional approximating regression models
In this article we investigate consistency of selection in regression models via the popular Lasso method. Here we depart from the traditional linear regression assumption and consider approximations of the regression function $f$ with elements of a …
Authors: Florentina Bunea
IMS Collectio ns Pushing the Limits of Con temp orary Statist ics: Contributions in Honor of Jay an ta K. Ghosh V ol. 3 ( 2008) 122–137 c Institute of Mathe matical Statistics , 2008 DOI: 10.1214/ 07492170 80000001 01 Consis ten t selecti on via t he L asso for hi gh dimension al appro ximating regr ession mo de ls Floren tina Bunea ∗ 1 Florida State Univ ersity Abstract: In t his article w e inv estigate consistency of selec tion in regression models via the popular Lasso m ethod. Here w e dep art from the traditional linear regression assumption and consider appro ximations of the regression function f with elements of a given dictionary of M f unctions. The target f or consistency is the index set of those functions from this dictionary that realize the most parsimonious appro ximation to f among all linear com binations b e- longing to an L 2 ball cen tered at f and of radius r 2 n,M . In this framework we sho w that a consistent estimate of this i ndex set can b e derived via ℓ 1 penal- ized least squares, with a data dependent penalty and with tuning sequence r n,M > p log( M n ) /n , where n is the sample size. Our r esults hold for any 1 ≤ M ≤ n γ , for an y γ > 0. Con ten ts 1 Int ro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 1.1 Beyond linea r regr e s sion . . . . . . . . . . . . . . . . . . . . . . . . . 125 2 Consistent selectio n via ℓ 1 pena lized least squa res . . . . . . . . . . . . . . 12 6 2.1 Main re s ult: consistent subs e t selection . . . . . . . . . . . . . . . . . 1 2 7 2.2 Pr o of of Theorem 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 29 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 1. Introduction In this pap er w e show that the p opula r L asso technique can be used for co nsistent feature selec tio n in hig h dimensional appr oximating regr ession mo dels. W e consider the follo wing framework. Giv en a random pair ( X, Y ), we let f ( x ) = E ( Y | X = x ) be the conditional mean function, henceforth called the reg ression function. W e aim to r econstruct c o nsistently a spa r se appr oximation o f f via linear combinations of elements of a giv en dictionar y of functions F = { f 1 , . . . , f M } . This re c onstruction will be base d on ( X 1 , Y 1 ) , . . . , ( X n , Y n ), a sample of independent r andom pair s dis- tributed as ( X, Y ) ∈ ( X , ℜ ), where X is a Borel subset of ℜ d ; all functions f j are defined o n X . Our a im express es the b elief that, in many instances, even if M is ∗ Supported in part by NSF Gran ts DMS- 04-06049 and DMS-07-06829. 1 Departmen t of Statistics, Fl orida Sta te Univ ersit y , T allahassee, Florida, USA, e-mail: flori@st at.fsu.e du AMS 2000 subje c t classific ations: Pr imary 62G08; secondary 62C20, 62G05, 62G20. Keywor ds and phr ases: consistency , high dimension, Lasso, l 1 regularization, regression, penalty, s election. 122 Consistent sele ction via the L asso 123 large, only a subset of F may b e neede d to a pproximate f well. If that is the cas e, it may b e of int erest to determine whether this set ca n b e estimated consis ten tly via a computationally efficient method. The fo cus of this w ork is on co ns istent selec tio n via the Las s o when the s iz e of F grows p olynomially with the sample size n , that is M = n γ , for any γ > 0 . W e b egin by giving a n um ber of examples of dictionar ies F and as so ciated con- sistency issues . 1. If d = M and f j ( X ) = X j for all j , one may be interested in identifying the subset of v a riables with linear combinations clos e to f . A familiar particular case is linear reg ression, where one assumes that f ( X ) = λ ′ X , with λ ∈ ℜ M having non-zero compo nents in p ositions corres po nding to a set J ∗ ⊆ { 1 , . . . , M } . Here we depa rt from this traditional equality assumption and consider the mo re re- alistic case wher e f is not equa l to, but can be well approximated by a linear combination of the given v aria bles. W e discuss this in detail in the next section. 2. Another problem of interest is that of finding cons is tent ly a spars e linear approx- imation of f realized with elements from a la rge list of M p ossibly comp eting estimators. These estimates may corr esp ond to M different metho ds of estima - tion, may b e computed fro m M different samples with the same mea n function, or may corr esp ond to M different v alues o f a tuning para meter o f the same metho d. Ins ta nces o f the latter ar is e in kernel based metho ds that requir e the choice of a gr id of v a lues for the bandwidth par ameter or in Bay esian methods, where the sp ecification of a g rid of v alues for hyper-par ameters is needed. A consistent identification of a subset of the estimates in these ex amples would v alidate the use of a particular restriction on an initially lar ge grid. In such sit- uations, when the elemen ts of F ar e estimator s, we will a s sume tha t they have bee n computed on s amples indep endent of the one used for subset selection and treat them her e as fixed functions. 3. A la st example is the nonpara metric estimation of f from a co llection o f M giv en basis functions, where only a subset may realize a go o d approximation of f , as describ ed in the follo wing subse ction. There exis t a num ber of mo del sele c tio n metho ds that yield consistent s ubs et selection in reg ression mo dels. In disc us sing them a num ber of distinctions a r e needed. The first o ne per tains to the evolution o f the liter ature on model sele c tion tec h- niques in regr ession. One impo rtant cut-off p o in t in this evolution seems to b e the computational complexity of a particular metho d and, within this, the size of M relative to n plays a crucial role. If M ≤ n , pro c e dures based on v a rious information criteria o ccupy an imp ortant place. They are referred to now as the BIC/AIC-type metho ds; we mention here the semina l works of ([ 1 ], [ 15 ]), the unifying theo ry of [ 2 ], and, v arious g eneralizations of these metho ds ([ 4 ], [ 7 ]). Such pro ce dures can be ea sily implemented for small to mo der ate M . F or larg er v alues of M multiple testing pro cedures, in particula r of the FDR type (e.g., [ 3 ], [ 9 ]), or cross -v alida tion with all its v ariant s (holdout v alidation, K -fold) [ 21 ], a re p opular, but b ecome more co mputatio nally complex as M increases. If M > n these tec hniques may be- come computatio na lly int ractable, unless they are used as par t of a multiple-stage scheme. F or a further ov erview on computational asp ects in model selection, fr om a Bayesian p ers pe c tive, see [ 11 ]. Whereas the ab ov e mentioned methods ca n still be us e d in very particular r e- gressio n mo dels when M > n , for instance, for se q uence-space mo dels, where mo del selection via BIC is equiv alen t to hard thresholding, they t ypically fail, computa- 124 F. Bune a tionally , when M is large. A standa rd solution in this case is to seek es tima tes that solve a certain class of conv ex optimizatio n problems. Among the mo st p opula r estimates of this type in r egressio n is the p enaliz e d least squar e s es tima te with an ℓ 1 -type p enalty (Lasso), which we describ e in detail in the next section. In a Bay esian framework it can b e derived from a Ga ussian likelihoo d with a La place prior. Two imp ortant aspe cts set the ℓ 1 regular iz ed (Las s o) type estimators apart: they a re ea s y and fast to compute; see [ 8 ], [ 13 ], [ 14 ], [ 18 ], a mong others, for efficient algorithms; and, if M > n , some comp onents of the estimate will be set to z e ro, in finite sa mples, see, e.g., [ 13 ]. Therefore , via a one-step eas ily implementable pro- cedure, one obtains s ubset selection even if M > n . T o date, this metho d (or its v aria nts) is the mo st widely used in r egressio n proble ms o f very high dimension, esp ecially when dimension reduction is of int erest. The second distinction in discussing consistency of selec tion in regr ession is re- lated to the targ et for consistency . C o nsistency of selection has b een studied for a ll aforementioned techniques only in the following co nt ext, which we term par ametric: the target for selection is typically a n index se t J ∗ corres p o nding to the non-zer o true reg ression co efficien ts, whereas the remaining co e fficie n ts are assumed to b e exactly zero. An estimation metho d that uses the data a nd all M elements f j to yield a subset ˆ I of indice s such that P ( ˆ I = J ∗ ) → 1 for la rge n is ca lled a consistent metho d of selectio n. In light of thes e t wo distinctions we give b elow a summar y of the e xisting re- sults on consistency of selection. They ha ve a ll b een established for the traditiona l parametric tar get J ∗ . If M ≤ n a nd under appro priate assumptions all the ab ov e metho ds, o r close v aria nts, yie ld co ns istent subset selection for the par ametric ta r get J ∗ . Refere nc e s include those for AIC/BIC-t yp e metho ds ([ 4 ], [ 10 ], and [ 22 ], among o thers), multiple testing pr o cedures [ 5 ], cro ss-v a lidation pro c edures [ 16 ], and Lasso-type pro cedures [ 24 ]. If M > n consistency o f selec tion has o nly b een studied for Las so-type estimator s . Again, in the existing liter ature, the tar get is the standar d tar get J ∗ . The r e sults a re limited. Meinshaus en and Buhlmann [ 12 ] show ed tha t P ( ˆ I = J ∗ ) → 1 in Gaus sian graphical mo dels, under ass umptions that ar e tailor ed to mo dels for which, in our notations, ( Y , X 1 , . . . , X M ) ∼ N (0 , Σ). Consistency of se lection ha s b een established when M > n , for fixed design linear regress ion mo dels and a target set J ∗ that corres p o nds to co efficients λ ∗ j that ar e assumed to b e low er b ounded by a sequence of o rder O ( n − δ/ 2 ), for 0 < δ < 1 [ 23 ]. Similar r esults, under slightly different assumptions, ha ve also been obtained for a three stage procedur e [ 20 ]: in the first stage Lasso estimates are computed for a num b er o f v alues of the tuning pa r ameter, in the second step cross- v alidation is p erfor med to select one Lasso estimate, and in the thir d o ne the mo del is r efitted on the v a riables pr esent in the selected Lass o estimate. W e also refer to a related notion o f consistency , in fixed design r e gressio n with Gaussian err ors [ 19 ]. If M > n consistent subset selection via the Las s o ha s no t b een inv estigated, to the b est o f our knowledge, in the general fr amework we describe in detail below. Within this framework, w e extend the existing results to more gener a l regre ssion mo dels on a ra ndom design and a mor e g eneral target index set. Consistent sele ction via the L asso 125 1.1. Beyond li ne ar r e gr ession Despite its practical app ea l, the s tudy of selection pro cedur es that are co nsistent for tar get sets other than the cla ssical one has r eceived v ery little a tten tion. Our target set will b e defined relative to linear approximations of f with elements of F with resp ect to the L 2 ( ν ) norm k · k , where we denote the proba bilit y mea sure of X by ν . F ormally , define (1.1) Λ = λ ∈ ℜ M : k M X j =1 λ j f j − f k 2 ≤ C f r 2 n,M , where C f > 0 is a consta nt depending only o n f and r n,M is a p os itive s equence that conv erges to zero and whic h will be sp ecified in the next section. In what follows we as sume that Λ is no t void. F or any λ ∈ ℜ M we le t J ( λ ) denote the index set corresp onding to the non-zero comp onents of λ a nd denote by M ( λ ) its cardinality . Let k ∗ = min { M ( λ ) : λ ∈ Λ } . W e define our target vector (1.2) λ ∗ = ar gmin k M X j =1 λ j f j − f k 2 : λ ∈ R M , M ( λ ) = k ∗ . Let I ∗ = J ( λ ∗ ) denote the index set corr esp onding to the non-zero elements of λ ∗ and note that I ∗ has cardinality k ∗ . Thus f ∗ = P j ∈ I ∗ λ ∗ j f j provides the s parsest approximation to f that can b e rea lized with λ ∈ Λ and, in pa rticular, (1.3) k f ∗ − f k 2 ≤ C f r 2 n,M . This motiv a tes our treating I ∗ as the targ et index set. W e note that if one assumes, as in standar d linear r egress ion mo dels , that f ( x ) = P M j =1 λ j x j = P j ∈ I ∗ λ ∗ j x j = f ∗ ( x ), where λ ∗ j denotes the non- zero comp onents of λ , then ( 1.3 ) is trivially s atisfied fo r any p os itive sequence r n,M . Ther e fo re, the classical target J ∗ is a particula r case of ours. In or der to ensure that λ ∗ captures the essential featur es of f in a pars imonious wa y we require that its co mpo nent s not be unnecessar ily small, otherwise we can place their indices outside I ∗ . F ormally , we will require that the following condition holds. Condition (C). Ther e exists B > 0, indep endent of n or M , suc h that min j ∈ I ∗ | λ ∗ j | > B r n,M . W e sho w below tha t ℓ 1 pena lized least squares can b e used to estimate consis- ten tly the new target I ∗ , even if M is larg er than n , in particular if it grows a s n γ , for any γ > 0, under minimal assumptions on the dictiona ry F and appropria te choices for r n,M . In Section 2 b e low we in tro duce the estimate and discuss these choices. Section 2.1 con tains o ur main result, Theorem 2.1, together with a discus- sion of the assumptions under which it holds. The pro o f of the main result is given in Section 2.2 and intermediate results are proved in the App endix. 126 F. Bune a 2. Co nsistent s election via ℓ 1 p enalized l east squares W e estimate the set I ∗ of the previous section via ℓ 1 pena lized lea s t squar es. W e first compute b λ = ar g min λ ∈ℜ M 1 n n X i =1 { Y i − M X j =1 λ j f j ( X i ) } 2 + p en( λ ) , (2.4) where pen( λ ) = 2 M X j =1 ω n,j | λ j | with ω n,j = r n,M k f j k n , (2.5) for a sequence r n,M given below, where we write k g k 2 n = n − 1 P n i =1 g 2 ( X i ) for any function g : X → ℜ . W e note that e a ch λ j in the p enalty term has a different, data- depe ndent, weigh t. The estimate ˆ λ th us obtained is in o ne-to-one corr esp ondence with the follo wing estimate. F or each 1 ≤ j ≤ M define θ j = 2 ω n,j λ j and let A b e the M × M diagona l matrix with dia gonal entries 2 ω n,j . Next o bserve that F λ = F 1 θ , where F is the n × M matrix with entries f j ( X i ), F 1 = F A − 1 and θ = Aλ . Thus, denoting by Y the n dimensional vector with entries Y i , the problem reduces to ca lculating b θ = a rg min θ ∈ℜ M 1 n ( Y − F 1 θ ) ′ ( Y − F 1 θ ) + M X j =1 | θ j | , for whic h the aforementioned fast algorithms can be used. Then, we compute our sought so lutio n ˆ λ = A − 1 ˆ θ . W e let ˆ I denote the index s e t corr esp onding to the non-zero comp onents of ˆ λ . W e show in the next subsection that P ( ˆ I = I ∗ ) → 1 when n → ∞ . W e b egin by noticing that we alwa ys hav e P ( ˆ I = I ∗ ) ≥ 1 − P ( I ∗ 6⊆ ˆ I ) − P ( ˆ I 6⊆ I ∗ ) . Therefore, pr oving that ˆ I is consistent reduces to s howing that each of the pr o b- abilities in the rig ht-hand side o f the inequality ab ov e converge to zer o. In what follows we motiv ate choices for the s e quence r n,M that s tem from sufficient co n- ditions under which this co nv e rgence is ac hiev ed. The pro ofs are presented in the next section. W e be gin by noticing that if ˆ λ → λ ∗ , with probability conv erging to o ne, then I ∗ 6⊆ ˆ I with probability conv erging to ze r o. T o s ee this, further note that if compo nent -wise consistency of ˆ λ ho lds, we will estimate al l non-zero element s o f λ ∗ by no n-zero sequences, but we ma y also estimate some of its zer o comp o nents by so me small, but non-zero sequences . In light o f this fact, a first set of restric tio ns on r n,M will b e such that ˆ λ is close to λ ∗ , in the sense b elow. It follows immediately (b y [ 5 ], Theor em 2.3 ; see the App endix b elow for a full formulation) that, with high probability r n,M | ˆ λ − λ ∗ | 1 ≤ D {k f − f ∗ k 2 + k ∗ r 2 n,M } , for some p ositive co nstant D , and where | a | 1 = P M j =1 | a j | denotes the ℓ 1 norm of any vector in ℜ M . Next, notice that the optimal par a metric r a te of conv ergence Consistent sele ction via the L asso 127 for a comp o nent ˆ λ j of ˆ λ is of o rder 1 / √ n , and it can b e achiev ed if we knew I ∗ of cardinality k ∗ < M in adv a nc e . How ev er, this is not known, so the b est we can do is mimic this b ehavior in our context. W e can do this by choos ing r n,M of or der 1 / √ n , where we recall that we have ass umed that k f − f ∗ k 2 ≤ r 2 n,M . Notice further that this choice is optimal for the ra te of conv ergence of ˆ λ , which is not the focus here. Indeed, mor e mo dest ra tes of co nv e r gence o f ˆ λ can be considered when consistency of selection is of main imp ortance. W e discuss in detail tw o concrete c hoices, and defer a complete analysis for future w ork. One can consider r n,M = A p log( M n ) /n , for an appr opriately large constant A > 0. Notice that this c hoice differs from the one that yields the o ptimal rate only by logar ithmic factors, which are needed to accommo date dictionar ie s with M > n . With this choice, the targ e t s et I ∗ corres p o nds to linear combinations of the elements of F that belo ng to , up to log a rithmic factors, a 1 / √ n neighborho o d of f , with resp ect to the L 2 ( ν ) norm. This provides o nly a slight departur e from the standard linea r model assumption and sta ndard tar g et index set J ∗ . It is therefore not surpr ising that, in this case, our tuning sequence r n,M is also compa r able to the one co ns idered in pa r ametric mo dels ([ 12 ], [ 23 ]), wher e a sequence of the order of 1 /n 1 / 2 − θ , θ ∈ (0 , 1 / 2), is emplo y ed. W e note that this c hoice is slig htly conser - v ative, and can b e relaxed to O ( p log( M n ) /n ) in our framework, and therefore, as a particula r case, in theirs. In or der to accommo date consistent selection in a purely nonparametric frame- work w e need to increase the size o f r n,M . F or instance, if all f j are estimates of f , and r n,M is as before, the set Λ defined in ( 1.1 ) may b e e mpt y , a s non-pa rametric estimates of f ha v e typically slow er ra tes than 1 / √ n . W e therefore consider target sets I ∗ corres p o nding to L 2 ( ν ) neighbor ho ods around f of radius r 2 n,M , now with r n,M = O (log( M n ) /n ) 1 / 4 . In this case, the set Λ given in ( 1.1 ) abov e is not empt y if at le a st one o f the estimato rs f j has, up to logarithmic factors, a rate of the order n − 1 / 4 , which is a mo dest rate to require. Of cour se, if f j ( X ) = X j , as in linear regress ion, this choice means that w e may b e conten t with a coarser approximation than b efore. How ev er, note that this approximation has the b enefit of b eing realized with a smaller num b er of v a riables and that this may inc r ease the int erpretability of that particula r mo del and be a desirable pr op erty in pr actical situations. The results presented b elow ho ld for either of these choice, in par ticular for any r n,M ≥ A p log( M n ) /n , and we will therefo re not distinguish b etw een them. 2.1. Main r esult: c onsistent subset sele cti on W e beg in by listing and commenting o n the assumptions under which o ur r e s ult holds. The first assumption r efers to the error terms W i = Y i − f ( X i ). W e r e call that f ( X ) = E ( Y | X ). Assumption (A1). The ra ndom variables X 1 , . . . , X n ar e indep endent, identic al ly distribute d r andom vari ables with pr ob abi lity me asur e µ . The r andom variables W i ar e indep endently distribute d with E { W i | X 1 , . . . , X n } = 0 and E { exp( | W i | ) | X 1 , . . . , X n } ≤ b for some finite b > 0 and i = 1 , . . . , n. 128 F. Bune a W e also imp os e mild conditions on f and on the functions f j . Let k g k ∞ = sup x ∈ X | g ( x ) | for an y function g on X . Assumption (A2). (a) Ther e exists 0 < L < ∞ such that k f j k ∞ ≤ L for al l 1 ≤ j ≤ M . (b) Ther e ex ist s c 0 > 0 such that k f j k ≥ c 0 for al l 1 ≤ j ≤ M . (c) Ther e exists L 0 < ∞ such t hat E [ f 2 i ( X ) f 2 j ( X )] ≤ L 0 for al l 1 ≤ i , j ≤ M . (d) Ther e ex ist s L 1 < ∞ such t hat k f k ∞ ≤ L 1 < ∞ . (e) Ther e exists L ∗ < ∞ such that k f − f ∗ k ∞ ≤ L ∗ . Remark 2.1. W e note that ( a ) trivially implies ( c ). Howev er, as the implied b o und may be to o la rge, we opted for stating ( c ) sepa rately . Note also that ( a ) a nd ( d ) imply the following: fo r any fixed λ ∈ ℜ M , there exis ts a p ositive constant L ( λ ), depe nding on λ , such that k f − P M j =1 λ j f j k ∞ = L ( λ ). Insp ection of the pro of of Theorem 2.1 b elow shows tha t we can allow L ∗ to grow very slowly with n . How ever, for sake o f clarity in pr esentation we opted fo r treating it as fixed. Assumption (A3). L et ρ M ( i, j ) = < f i , f j > k f i kk f j k , wher e < f i , f j > = E f i ( X ) f j ( X ) and k f i k = E 1 / 2 f 2 i ( X ) . Assume that max i ∈ I ∗ max j 6 = i | ρ M ( i, j ) | ≤ C k ∗ , for some c onstant C > 0 . Remark 2.2. F o llowing [ 6 ], C = 1 / 45 is an allowable c hoice. Other choices a re po ssible, but improvemen t of consta nts is beyond the scope of this paper . Remark 2.3. Assumption (A3) reflects the b elief that the corr e la tions b etw een functions f j with j ∈ I ∗ and functions f j with j / ∈ I ∗ should b e small. Howev er, we a llow the corr e lations outside I ∗ to be arbitrary . W e note that this assumption replaces the standard orthonormality assumption on the design matrix: it is given in terms of theoretical quantities a nd it ca n hold even if M > n . It can be chec ked in practice by r eplacing the theor etical correlations by sa mple correlatio ns . W e denote by G the ev ent that the n × M matrix F with entries f j ( X i ) has full rank. T o av oid additional technicalities, the results of this pap er ca n b e reg arded as conditional o n G . Otherwis e, a ll the results can b e re-derived by in tersecting all the relev ant even ts with G a nd G c , under the additional assumption that P ( G c ) is appropria tely sma ll. W e c a n now state o ur main result which we prov e in the nex t subsection. Theorem 2.1 . If assumptions (A1)–(A3) and c onditio n (C) hold, and k ∗ r n,M → 0 then P ( ˆ I = I ∗ ) → 0 . Remark 2 .4. The conv ergence ab ove holds either if M is fixed and n → ∞ or if bo th M , n → ∞ , if r n,M ≥ A p log( M n ) /n for a n appropriately large c onstant A . Therefore we obtain consistency for b oth choices of r n,M discussed a b ove. In our deriv a tio ns we require that M do e s not grow fas ter than a p ow er of n . Remark 2.5. The condition r n,M k ∗ → 0 impo ses re strictions on the s ize of k ∗ . If r n,M = O ( p log( M n ) /n ) the theor em ab ove shows that we can recover consis- ten tly subs e ts o f size k ∗ = O ( √ n/ log n ), up to o ther loga rithmic factors. The Consistent sele ction via the L asso 129 choice r n,M = O (log ( M n ) /n ) 1 / 4 corres p o nds to a coarser approximation o f f than before, and the r estriction on the num b er of approximating functions is no w k ∗ = O ( n 1 / 4 / log n ). 2.2. Pr o of of The or em 2.1 Recall that P ( ˆ I = I ∗ ) ≥ 1 − P ( I ∗ 6⊆ ˆ I ) − P ( ˆ I 6⊆ I ∗ ) . Therefore, pro ving that ˆ I is co nsistent r educes to showing that each of the proba- bilities in the r ight hand s ide of the ineq uality a b ove conv erge to zero. W e present this in the following tw o pro p ositions. W e defer the pro of o f the intermediate re sults to the App endix. Prop ositi on 2.2. If assumptions (A1)–(A3) a nd c ondition (C) ho ld, and r n,M k ∗ → 0 , t hen P ( I ∗ 6⊆ ˆ I ) → 0 as n → ∞ , fo r any r n,M ≥ A p log( M n ) /n , with A > 0 lar ge enough. Pr o of. W e follow the sa me r easoning as [ 4 ]. Let c n = min k ∈ I ∗ | λ ∗ k | and recall that c n > B r n,M , by condition (C). Therefore P ( I ∗ 6⊆ ˆ I ) ≤ P ( j / ∈ ˆ I for some j ∈ I ∗ ) ≤ P ( | b λ j − λ ∗ j | = | λ ∗ j | ) ≤ P ( | b λ j − λ ∗ j | > c n ) → 0 , a s n → ∞ where, in the second ineq uality , we used that b λ j = 0 for j / ∈ ˆ I , by the definition of ˆ I . The last inequality follows from Coro llary 1 prese nted in the Appendix below. Prop ositi on 2. 3. If assumptions (A1)–(A3) hold and r n,M k ∗ → 0 , then P ( ˆ I 6⊆ I ∗ ) → 0 , as n → ∞ , for any r n,M ≥ A p log( M n ) /n , with A > 0 lar ge enough. Pr o of. Let h ( µ ) = 1 n n X i =1 { Y i − X j ∈ I ∗ µ j f j ( X i ) } 2 + 2 r n,M X j ∈ I ∗ || f j || n | µ j | , and define ˜ µ = arg min µ ∈ℜ k ∗ h ( µ ) . (2.6) Let B = \ k / ∈ I ∗ | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | < 2 r n,M || f k || n . Let ˜ λ ∈ ℜ M be the vector that has the comp onents of ˜ µ in p os itions cor resp onding to the index set I ∗ and comp onents equal to zer o otherwis e. Thus, by abuse of notation, ˜ λ = ( ˜ µ, 0) . F rom Lemma 3.4 in the Appendix it follows that, on the set B , ˜ λ is a so lutio n of ( 2.4 ). Reca ll that ˆ λ is a so lutio n of ( 2.4 ) by constructio n. Then, by a rguments similar to those used in ([ 13 ], Theor ems 3.1 and 3.2) re garding the 130 F. Bune a closeness of tw o solutions it follows that, on the set B , ˆ λ k = 0 for k ∈ I ∗ c . Therefore ˆ I ⊆ I ∗ on the set B . Hence P ( ˆ I 6⊆ I ∗ ) ≤ P ( B c ) = P [ k ∈{ 1 ,...,M }\ I ∗ | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ 2 r n,M || f k || n ≤ X k ∈{ 1 ,... ,M }\ I ∗ P | 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ 2 r n,M || f k || n . Let k ∈ { 1 , . . . , M } \ I ∗ be fixed. Define the sets E 1 ( k ) = ( 1 n | n X i =1 W i f k ( X i ) | < r n,M k f k k n / 2 ) , E 2 ( k ) = k f k k 2 n ≥ 1 4 k f k 2 , E 3 ( k ) = ( | 1 n n X i =1 f j ( X i ) f k ( X i ) | ≤ 2 |h f j , f k i| + δ n,M , j ∈ I ∗ ) , where δ n,M = 2 C L 2 r n,M will b e sp ecified b elow. The choice of δ n,M is pur ely techn ical and do es not affect the ov erall results. Let ˜ f = P j ∈ I ∗ ˜ µ j f j . Reca ll that λ ∗ ∈ R M given by ( 1.2 ) has zero comp onents in po sitions cor resp onding to indices in I ∗ c , by definition. Let µ ∗ be the vector in ℜ k ∗ obtained fro m λ ∗ by deleting these zer os. Therefore f ∗ = P M j =1 λ ∗ j f j = P j ∈ I ∗ µ ∗ j f j . By successive applicatio ns of the triang le inequality a nd s ince k f k k n ≤ L , for all k ∈ I ∗ c , b y assumption (A2) ( a ), we obtain: P 1 n | n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) | ≥ r n,M k f k k n (2.7) ≤ P 1 n | n X i =1 W i f k ( X i ) | ≥ r n,M k f k k n / 2 ! + P 1 n | n X i =1 ( f ( X i ) − ˜ f ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 2 ! ≤ P ( E c 1 ( k )) + P 1 n | n X i =1 ( f ∗ ( X i ) − ˜ f ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4 ! + P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ r n,M k f k k n / 4 L ! ≤ P ( E c 1 ( k )) + P | ( X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4 + P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ r n,M k f k k n / 4 L ! . Consistent sele ction via the L asso 131 T o b ound the second term in the last inequalit y a bove we first notice that on the set E 3 ( k ) a nd under assumptions (A2) (a) and (A3) w e ha ve | X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≤ 2 X j ∈ I ∗ | ˜ µ j − µ ∗ j ||h f j , f k i| + δ n,M X j ∈ I ∗ | ˜ µ j − µ ∗ j | ≤ 2 C L 2 k ∗ | ˜ µ − µ ∗ | 1 + δ n,M | ˜ µ − µ ∗ | 1 . Therefore, on E 2 ( k ) ∩ E 3 ( k ), and under ass umptions (A2), (a) and (b), and (A3) we have P ( | ( X j ∈ I ∗ ( ˜ µ j − µ ∗ j ) 1 n n X i =1 f j ( X i )) f k ( X i ) | ≥ r n,M k f k k n / 4) ≤ P ( | ˜ µ − µ ∗ | 1 ≥ c 0 32 C L 2 k ∗ r n,M ) + P ( | ˜ µ − µ ∗ | 1 ≥ c 0 16 r n,M δ − 1 n,M ) ≤ 2 P ( | ˜ µ − µ ∗ | 1 ≥ c 0 32 C L 2 k ∗ r n,M ) , (2.8) for n lar ge eno ug h, s ince the assumption k ∗ r n,M → 0 implies that k ∗ r n,M ≤ 1 for large n , and we recall that we defined δ n,M = 2 C L 2 r n,M . Lastly , notice that on the s e t E 2 ( k ) a nd under assumption (A2) ( b ) and ( e ) the third term o f the last inequality in display ( 2.7 ) can b e b o unded by (2.9) P ( 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ) . T o complete the pro of w e need to show that P ( E c 1 ( k )) , P ( E c 2 ( k )) and P ( E c 3 ( k )) and the pro babilities in ( 2.8 ) and ( 2.9 ), when summed over k ∈ { 1 ., . . . , M } \ I ∗ , conv erge to zero as n → ∞ . W e show this in Lemma 3 .5 , Cor ollary 2 and L e mma 3.6 , resp ectively , in the App e ndix below. This completes the pro o f of this res ult. App endix In order to show Pr op osition 2.2 and to bo und ( 2.8 ) ab ov e we will use t wice ([ 6 ], Theorem 2 .3 page 1 77) and w e beg in by stating it here, for completeness. F o r any λ ∈ ℜ M we let J ( λ ) denote the index set corres po nding to the non-zero comp onents of λ and deno te by M ( λ ) its ca rdinality . Let ρ ( λ ) = max i ∈ J ( λ ) max j 6 = i | ρ M ( i, j ) | . With Λ given b y ( 1.1 ) in Section 1 .1, let Λ 1 = { λ ∈ Λ : ρ ( λ ) ≤ C / M ( λ ) } . Theorem 2.3 ([ 6 ]) . Assume that (A1) and (A2) hold. The n the ℓ 1 p enalize d le ast squar es estimator ˆ λ giv en by ( 2.4 ) satisfies, for any λ ∈ Λ 1 (3.10) P n | b λ − λ | 1 ≤ B 1 r n,M M ( λ ) o ≥ 1 − π n,M ( λ ) , wher e π n,M ( λ ) ≤ 14 M 2 exp − c 1 n min ( r 2 n,M L 0 , r n,M L 2 , 1 L 0 M 2 ( λ ) , 1 L 2 M ( λ ) )! exp − c 2 M ( λ ) L 2 ( λ ) nr 2 n,M 132 F. Bune a for some p ositive c onst ants c 1 , c 2 dep ending on c 0 , C f and b only, and a c onstant B 1 dep ending on c 0 and C f . Notice now that by ( 1 .3 ) a nd under a ssumption ( A 3), λ ∗ ∈ Λ 1 . W e therefor e hav e the following corollar y . Corollary 1. Assume that (A1)–(A3) hold. Then P n | b λ j − λ ∗ j | > B 1 r n,M o ≤ π ∗ , for al l 1 ≤ j ≤ M , wher e π ∗ = π n,M ( λ ∗ ) . Pr o of. F rom ([ 6 ], Theorem 2.3) we o btain 1 − π ∗ ≤ P n | b λ − λ ∗ | 1 ≤ B 1 k ∗ r n,M o ≤ P min 1 ≤ j ≤ M | b λ j − λ ∗ j | ≤ B 1 r n,M . This immediately implies the result. Remark 3.1. Notice that π ∗ → 0 as n → 0 for any r n,M ≥ A p log( M n ) /n , and for B = B 1 , as needed in Prop osition 2.2 in Section 2.2 ab ove. In order to control the proba bilit y ( 2.8 ) we first define U a nd U 1 , the analogues of the sets Λ and Λ 1 defined ab ov e. U = µ ∈ ℜ k ∗ : k f − X j ∈ I ∗ µ j f j k 2 ≤ C f r 2 n,M , U 1 = { µ ∈ U : ρ ( µ ) M ( µ ) ≤ C } . Recall that µ ∗ is the v ector in ℜ k ∗ obtained from λ ∗ by deleting the zero entries. Then, since assumption (A3) implies max i ∈ I ∗ max j ∈ I ∗ ,j 6 = i | ρ M ( i, j ) | ≤ C /k ∗ and k f − P M j =1 λ ∗ j f j k = k f − P j ∈ I ∗ µ j f j k we deduce that µ ∗ ∈ U 1 . T he r efore, using again ([ 6 ], Theore m 2 .3 ) a pplied now to the dictio nary { f j } j ∈ I ∗ and quantit y ˜ µ defined in ( 2.6 ) ab ov e, we obtain the following corollary : Corollary 2. Assume that (A1)–(A3) hold. Then (3.11) P {| ˜ µ − µ ∗ | 1 ≤ B 2 k ∗ r n,M } ≥ 1 − p ∗ , wher e p ∗ ≤ 14 k ∗ 2 exp − c 1 n min ( r 2 n,M L 0 , r n,M L 2 , 1 L 0 k ∗ 2 , 1 k ∗ L 2 )! + exp − c 2 k ∗ L 2 ( λ ∗ ) nr 2 n,M , for some p ositive c onstants c 1 , c 2 as ab ove and a c onstant B 2 > 0 that only dep ends on C f and c 0 . Remark 3.2. If r n,M ≥ A p log( M n ) /n , then M p ∗ → 0 a s n → ∞ , for A > 0 large enough. Hence, the probability giv en by ( 2.8 ), summed ov er k , converges to zero for b oth choices of r n,M int ro duced in Section 2, adjusting the v alue of B 2 if needed. The following lemma is needed in the b eginning of the pro of of Pro p osition 2.3 . Consistent sele ction via the L asso 133 Lemma 3. 4. ˜ λ = ( ˜ µ, 0) is a solution of ( 2.4 ) on the set B = \ k / ∈ I ∗ 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) < 2 r n,M || f k || n . Pr o of. W e reca ll that for any conv ex function g : ℜ M → ℜ the sub differential of g at a point λ is the set D λ = { w ∈ ℜ M : g ( u ) − g ( λ ) ≥ h w , u − λ i} . Let g ( λ ) = 1 n P n i =1 { Y i − P M j =1 λ j f ( X i ) } 2 + p en ( λ ), wher e we recall that our p enalty term is p en( λ ) = 2 r n,M P M j =1 k f j k n | λ j | . Then (e.g., [ 13 ]) we hav e D λ = { w ∈ ℜ M : w = − 2 n F ′ ( Y − F λ ) + 2 r n,M v } , where v ∈ ℜ M is such that v k = || f k || n , if λ k > 0 v k = −|| f k || n , if λ k < 0 v k ∈ [ −|| f k || n , || f k || n ] , if λ k = 0 , and where we reca ll that Y = ( Y 1 , . . . , Y n ) and F is the n × M matrix with elements f j ( X i ). By sta ndard results in conv ex analysis, ¯ λ ∈ ℜ M is a po int of lo cal minim um for a conv ex function g if and only if 0 ∈ D ¯ λ , where 0 ∈ ℜ M . Therefore, ¯ λ minimizes our g ( λ ) if and only if 0 ∈ D ¯ λ if and only if 2 n F ′ ( Y − F ¯ λ ) k = 2 r n,M | v k | for all k ∈ { 1 , . . . , M } , where ( · ) k ab ov e deno tes the k - th compo nent of the vector in paranthesis. Equiv a- lent ly , ¯ λ minimizes g ( λ ) if and only if, for all 1 ≤ k ≤ M 2 n n X i =1 [ Y i − M X j =1 ¯ λ j f j ( X i )] f k ( X i ) = 2 r n,M || f k || n , if ¯ λ k 6 = 0 , (3.12) 2 n n X i =1 [ Y i − M X j =1 ¯ λ j f j ( X i )] f k ( X i ) ≤ 2 r n,M || f k || n , if ¯ λ k = 0 . In what follows we find co nditions under which ˜ λ = ( ˜ µ, 0), with ˜ µ given in ( 2.6 ) ab ov e, satisfies ( 3 .12 ). Firs t notice that, b y definition, P n i =1 [ Y i − P M j =1 ˜ λ j f j ( X i )] = P n i =1 [ Y i − P j ∈ I ∗ ˜ µ j f j ( X i )]. Since ˜ µ is a s olution of ( 2.6 ) then, by the ab ove standa r d results in conv ex analysis, applied no w to the function h ( λ ) defined in the pro of of Prop ositio n 2.3 , the following ho ld 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) = 2 r n,M || f k || n , if ˜ λ k = ˜ µ k 6 = 0 , k ∈ I ∗ , 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) ≤ 2 r n,M || f k || n , if ˜ λ k = ˜ µ k = 0 , k ∈ I ∗ . 134 F. Bune a Notice now that on the set B we also hav e 2 n n X i =1 [ Y i − X j ∈ I ∗ ˜ µ j f j ( X i )] f k ( X i ) ≤ 2 r n,M || f k || n , if k / ∈ I ∗ (for which ˜ µ k = 0) . The a bove displays show that ˜ λ satisfies c o ndition ( 3.12 ) and is there fo re a solution of ( 2.4 ) on B . Remark 3.3. The observ ation tha t cons titutes the sta temen t of the above lemma has als o b een made e ls ewhere [ 12 ] for a slightly different p enalty term. W e hav e included here a full deriv ation of it for completeness and clarity . T o complete the pro of of Prop os ition 2 .3 we will ma ke rep eated use of Ber ns tein’s inequality , which we s tate here for co mpleteness. Bernstein’s inequality . L et ζ 1 , . . . , ζ n b e indep endent r andom variables s uch that 1 n n X i =1 E | ζ i | m ≤ m ! 2 w 2 d m − 2 for some p ositive c onst ant s w and d and for al l inte gers m ≥ 2 . Then, for any ε > 0 we ha ve (3.13) P ( n X i =1 ( ζ i − E ζ i ) ≥ nε ) ≤ exp − nε 2 2( w 2 + dε ) . Lemma 3. 5. L et assumptions (A1) and (A2) hold . Then X k ∈{ 1 ,...,M }\ I ∗ P ( E c 1 ( k )) → 0 , X k ∈{ 1 ,...,M }\ I ∗ P ( E c 2 ( k )) → 0 , a nd X k ∈{ 1 ,... ,M }\ I ∗ P ( E c 3 ( k )) → 0 , a s n → ∞ . Pr o of. T o show P k ∈{ 1 ,... ,M }\ I ∗ P ( E c 1 ( k )) → 0 it is enough to show that ( I ) = P k ∈{ 1 ,...,M }\ I ∗ P ( E c 1 ( k ) ∩ E 2 ( k )) → 0 and tha t ( I I ) = P k ∈{ 1 ,...,M }\ I ∗ P ( E c 2 ( k )) → 0. The pro ofs follow immediately fro m Bernstein’s inequality a nd the union bound. They are the same as ([ 6 ], proofs of Lemmas 4 and 5, pag e 1 86). W e include here the derived pr obability b ounds, for completeness. ( I ) ≤ 2 M 2 exp − nr 2 n,M 16 b ! + 2 M 2 exp − nr n,M c 0 8 √ 2 L + 2 M 2 exp − nc 2 0 12 L 2 , and ( I I ) ≤ M 2 exp − nc 2 0 12 L 2 . Consistent sele ction via the L asso 135 T o bound the last q ua nt it y in the statement of the Lemma notice first that P ( E c 3 ( k )) ≤ 2 X j ∈ I ∗ P 1 n n X i =1 f j ( X i ) f k ( X i ) > 2 |h f j , f k i| + δ n,M ! ≤ 2 X j ∈ I ∗ exp − n 4 L 0 ( |h f j , f k i| + δ n,M ) 2 + 2 X j ∈ I ∗ exp n − n 4 L ( |h f j , f k i| + δ n,M ) o ≤ 2 M exp ( − nδ 2 n,M 4 L 0 ) + 2 M ex p − nδ n,M 4 L . The second inequality of the display ab ov e follows from Bernstein’s inequality with ζ i = f j ( X i ) f k ( X i ), for every fixed j , and k and with w 2 = L 0 , d = L 2 , for ǫ = |h f j , f k i| + δ n,m , use d tog ether with the inequality e x/a + b ≤ e x/ 2 a + e x/ 2 b for all x, a and b . Therefor e, for δ n,M = 2 C L 2 r n,M we obtain ( I I I ) = X k ∈{ 1 ,... ,M }\ I ∗ P ( E 3 2 ( k )) ≤ 2 M 2 exp ( − C 2 L 4 nr 2 n,M L 0 ) + 2 M 2 exp − C Lnr n,M 2 . Thu s, the qua ntities ( I ) , ( I I ) and ( I I I ) conv erge to zero for any r n,M ≥ A p log( M ) n/n . Lemma 3. 6. L et assumptions (A1) and (A2) hold . Then ( I V ) = X k ∈{ 1 ,... ,M }\ I ∗ P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ! → 0 . Pr o of. By the Cauch y -Sch w artz inequa lit y w e hav e P 1 n n X i =1 | ( f ( X i ) − f ∗ ( X i )) | ≥ c 0 8 L r n,M ! (3.14) ≤ P 1 n n X i =1 ( f ( X i ) − f ∗ ( X i )) 2 ≥ c 2 0 64 L 2 r 2 n,M ! ≤ P n X i =1 { ( f ( X i ) − f ∗ ( X i )) 2 − k f − f ∗ k 2 } (3.15) ≥ n ( c 2 0 64 L 2 r 2 n,M − k f − f ∗ k 2 ) ≤ P n X i =1 { ( f ( X i ) − f ∗ ( X i )) 2 − k f − f ∗ k 2 } ≥ C 1 nr 2 n,M ! , where we r ecall that k f − f ∗ k 2 ≤ C f r 2 n,M , by definition a nd C 1 = c 2 0 / 64 L 2 − C f , where w e a s sume that w e hav e a lready adjusted C f to have C 1 > 0, by taking an appropria te constant A in the definition of r n,M , if needed. The pro of follows 136 F. Bune a immediately from Bernstein’s inequa lity applied to ζ i = ( f ( X i ) − f ∗ ( X i )) 2 , with w = p C f r n,M and d = L ∗ , and for ǫ = C 1 r 2 n,M . Therefor e ( I V ) ≤ M exp {− C f C 2 1 4 nr 2 n,M } + M exp {− C 1 4 L ∗ nr 2 n,M } , and b oth terms co nv e rge to zero for e ither choice of r n,M . References [1] Akaike, H. (1974). A new lo o k a t the statistical mo del identification. IEEE T r ans. Automat. Contr ol 19 716–72 3. MR04237 16 [2] Barron, A., Birg ´ e, L. and Massar t, P. (1999). Risk bounds for mo del selection via p enalization. Pr ob ab. The ory R elate d Fields 113 301– 413. MR16790 28 [3] Benjamini, Y. and Hochberg, Y. (199 5). Con trolling the F alse Discov- ery Rate: a Pra ctical a nd Po w erful Approach to Multiple Hyp o thesis T esting . J. R oy. Statist. So c. Ser. B 57 2 8 9–30 0. MR13253 92 [4] Bunea, F. (2004). Consistent co v ariate selection and po st mo del selection inference in semipara metric regressio n. Ann. Statist. 32 898 –927 . MR20651 93 [5] Bunea, F., Wegkamp, M. H . a n d Auguste, A. (200 6). Consistent v a riable selection in hig h dimensiona l r egress io n via m ultiple testing. J. Statist. Plann. Infer enc e 136 434 9–436 4. MR23234 20 [6] Bunea, F., Tsybako v, A. B. and Wegkamp, M. H. (2007 ). Spar sity or a cle inequalities for the Las so. Ele ctr onic J . Statist. 1 169–19 4 . MR23121 49 [7] Chakrabar ti, A. and G hosh, J. K. (2006). A generaliza tion of BIC for the gener a l expo ne ntial families. J. Statist. Plann. Infer enc e 136 2847–2 872. MR22812 34 [8] Efron, B., Hastie, T. , Johnstone, I. and Tibshirani, R. (2004). Leas t angle regr ession. Ann. S tatist. 32 407 –451 . MR20601 66 [9] Genovese, C. and W asserman, L. (2004 ). A Sto chastic Pro cess Appr o ach to F alse Disco very Ra tes. Ann. Statist. 32 10 35–1 0 61. MR2065 197 [10] Guyon, X. and Y ao, J. (199 9). On the underfitting and overfitting sets of mo dels c hosen b y order selection criteria. J . Mult ivariate Anal. 70 221–3 15. MR17115 22 [11] Lahiri, P., ed. (20 01). Mo del Sele ction. Institute of Mathematic al S tatistics L e ctur e Notes – Mono gr aph Series 3 8 . IMS, Beach w o o d, O H. MR20 00750 [12] Meinshausen, N. and B ¨ uhlmann, P. (2006). High- dimensional graphs a nd v aria ble selection with the Las so. A nn. S tatist. 34 143 6–14 6 2. MR22783 63 [13] Osborne, M. R., P resnell, B. and Turlach, B. A. (2000a ). On the la sso and its dual. J. Comput. Gr aph. Statist. 9 319– 337. MR18220 89 [14] Osborne, M. R., Presnell, B. and Turlach, B. A . (2000b). A new approach to v aria ble selection in least sq ua res problems. IMA J. Numer. Anal. 20 389– 4 04. [15] Schw a rz , G. (1978). Estimating the dimension of a mo del. Ann. Statist. 6 461–4 64. MR04680 14 [16] Shao, J. (199 3 ). Linea r mo del s election by cr oss v alidation. J. A mer. Statist. Asso c. 8 88 486– 4 94. MR1224 373 [17] Tibshirani, R. (1 9 96). Regress ion shrink age a nd selection via the L asso. J. R oy. Statist. So c. Ser. B 58 2 6 7–28 8. MR137 9242 [18] Turlach, B. A. (2005). O n algorithms for solving least squares problems under an L1 p ena lt y or an L1 co nstraint. 2004 Pr o c e e di ngs of the Americ an Consistent sele ction via the L asso 137 Statistic al As so ciation, St atistic al Co mputing Se ction [CD-ROM] 2572– 2577 . American Statistical Asso ciation, Alexa ndria, V A. [19] W ainwright, M. J. (2007). Information-theor etic limits o n spa rsity recov ery in the hig h-dimensional and no isy setting. T echnical repor t, Dept. Statistics, UC Berkeley . [20] W asserman, L. and Roeder, K. (20 07). Hig h dimensional v ariable se le c- tion. T ech nical rep ort, Dept. Statistics, Carnegie Mello n Univ. [21] Wegkamp, M. H. (2003). Model selection in nonpa r ametric r egressio n. Ann. Statist. 31 252 –273. MR19625 06 [22] Woodr oofe, M. (1982). O n mo del selection and the arcsine laws. Ann. Statist. 10 11 82–11 94. MR06736 53 [23] Zhao, P . and Yu, B. (200 7 ). On mo del selection consistency of La s so. J. Ma- chine L e arning R ese ar ch 7 25 41–2 567. MR22744 49 [24] Zou, H. (200 6). The adaptive lass o and its oracle pro pe rties. J. Amer. St atist. Asso c. 1 01 1418 – 1429 . MR2279 469
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment