The Loss Rank Criterion for Variable Selection in Linear Regression Analysis

Lasso and other regularization procedures are attractive methods for variable selection, subject to a proper choice of shrinkage parameter. Given a set of potential subsets produced by a regularization algorithm, a consistent model selection criterio…

Authors: Minh-Ngoc Tran

The Loss Rank C riterion for V ariable Select i o n in Linear Regress ion Analys is Minh Ngo c T ran ∗ Departmen t of Statistics and Applied Probabilit y National Univ ersit y of Singap ore, Singap ore 11754 6 ngoctm@nus. edu.sg Ma y 27, 20 22 Abstract Lasso and other regularization pro cedur es a re attractiv e metho ds for v a riable selection, sub ject to a pr op er choic e of shrink age parameter. Giv en a set of p oten tial su bsets pro d u ced by a regularization algorithm, a consistent mo d el selection criterion is prop osed to select the b est one among this preselected set. The approac h leads to a fast and efficien t pro ce dure for v a riable selec- tion, esp ecially in high-dimens ional s ettings. Mo del selection consistency of the suggested criterion is prov e n when the num b er of co v aria tes d is fi xed. Sim ulation stud ies s u ggest that the criterion still enjo ys mo del selection con- sistency when d is muc h larger than the sample size. The simulati ons also sho w that our appr oac h for v ariable selection works surp risingly well in com- parison w ith existing comp et itors. The metho d is also applied to a real d ata set. Keyw ords Mo del selectio n, lasso, loss rank p rinciple, shrink age parameter, v a riable se- lection ∗ The autho r would like to thank the editor, an ass o ciate editor and tw o ano nymous reviewers for careful reading and constructive comments which help ed improv e the pap er grea tly . The author is also gra teful to Da vid J. Nott for helpful s ug gestions which led to a b etter presentation of the pap er. 1 1 In tro du ction V ar ia ble selec tion is probably the most fundamen tal and imp ortant topic in linear regression analysis . W e consid er the case where a large num ber (ev en la rger than the sample size) of candidate co v a riates are intro duced a t the initia l stag e of mo deling. One then has t o select a smaller subset of the co v ariates to fit/in terpret the data. If the nu m b er of p oten tial cov ariates is not so large (as small as 3 0), one may use subset selection to select significan t v a riables (Miller, 19 90). Ho w ev er, with a la rge n um b er of cov ariates, searc hing on mo del space is computationally infeasible. Lass o (Tibshirani, 1 9 96) and other regularization pro cedures (e.g., the adaptiv e lasso of Zou (200 6), SCAD of F an and Li (2001)) ar e successful metho ds to ov erc ome t his problem. A Lasso-t yp e pro cedu re estimates the regression co efficien t vec tor β by minimizing the sum of the squ ared error and a regularization term k y − X β k 2 + λT ( β ) , (1) where X is a n ( n × d ) non-random design matrix, y is an n − v ector of resp onses, and λ ≥ 0 is a shrink age pa rameter that con trols the amoun t of regularization. The regularization function T ( β ) can tak e differen t forms according to differen t regu- larization pro cedures. The orig inal and most p opular one used in L a sso is the l 1 norm T ( β ) = P d j =1 | β j | . As λ increase s, the co efficien ts are contin uously shrunk to w ards 0. When λ is sufficien tly large, some co efficien ts are shrunk to exact 0, th us leading to sparse solutions. This feat ure mak es the L asso-t yp e pro cedures very attractiv e for v ariable selection. Indeed, their mo del selection consistency has b een sho wn (Z hao and Y u, 2006; Meinshausen and Buhlmann, 2006; F an and Li, 2001): Under some conditions, there exists a “prop er” sequence of shrink age parameters { λ n } under whic h { j : ˆ β λ n j 6 = 0 } = S T w.p.1 when sample size n is lar g e enough , (2) where ˆ β λ n = ( ˆ β λ n 1 ,..., ˆ β λ n d ) ⊤ is the regularized estimator of β with shrink a ge parameter λ n , and S T is the true mo del, i.e., S T is the index set of true co v ariates. Therefore, it is con ve nien t to use the Lasso-t yp e pro cedures for v ariable selection purp oses. The remaining problem in practice is ho w to c ho o se suc h prop er λ n . A widely-use d criterion is the generalized cross-v alidation criterion (GCV) (Cra v en and W ah ba, 1 979; Tibshirani, 1996). Ho w eve r, theoretical prop ertie s of GCV for c ho osing λ f or the purp ose o f v aria ble selection hav e not b een in- v estigated y et. F urthermore, for c ho osing shrink age parameter for the SCAD metho d (F an and Li, 2001), a regularization metho d closely related to Lasso, GCV seems to b e likely to c ho ose shrink a ge parameters that pro duce ov erfitted mo dels (W ang et al., 2007 ) . Z ou et a l. (2007) sho w ed that the num b er of nonzero co effi- cien ts is an un biased estimate for the degrees of freedom o f the lasso. As a result, p opular mo del selection criteria - lik e AIC, BIC and C p - can b e used for selecting λ . Ho w eve r, theoretical prop erties of the selected mo del remain unknown . The main 2 con tr ibution of this pap er is to prop ose a criterion for selec ting shrink ag e parameters in order for r egula r ization pro ce dures to pro duc e the true mo del. (Throughout this pap er, the true mo del is assumed to exist. W e note, how eve r, that whether or not the true mo del exists is still a con tro v ersial issue in the mo de l selection literature, see Burnham and Anderson (2 002)). Although regularization pro cedures can b e used for sim ultaneous v aria ble selec- tion and estimation, it seems to b e imp o ssible to tune the shrink age pa r a meter to ac hieve b oth mo del selection consistency and optimal estimation at the same time. F or a n orthogonal design, Leng et al. (200 6) sho w ed that the Lasso estimator that is o ptimal in terms of estimation do es not giv e consisten t model selection. This fact w as also sho wn b y P o etsc her and Leeb (20 09) for other regularized estimators. W e are in this pap e r primarily concerned with the problem of v ariable selection, i.e., w e use a L a sso-t yp e pro cedure to pro duce a set of p ot ential subsets a nd then select the b est one among this preselected set using a mo del selection criterion. It was brough t to our atten tion b y a review er that the idea of using Lasso as a “selector” w as also briefly mentioned in F riedman (20 0 8); Efro n et al. (2004). They discussed a n ap- proac h to reduce estimation bias on the non-zero estimated co effic ien ts in whic h the Lasso (with some metho d fo r choosing the shrink age parameter) is used as a subset selector a nd then a differen t unp enalized pro cedure is used to estimate the co efficien ts w.r.t. the selected cov ariates. Our approach is to select the b est subset of co v aria tes - using a mo del selection criterion as the stopping rule - among a prese- lected set of p oten tial subsets pro duced b y a Lasso-ty p e pro cedure. The preselected set consists of at most d subse ts rather than 2 d p ossible subsets if using subset selec- tion. After selecting the b es t subset, w e of course can use an unp enalized pro cedure to estimate the coefficien ts in o r der to reduce estimation bias. The mo del selection criterion w e use is deriv ed from the loss r ank prin ciple (LoRP), a general-purp ose principle for mo del selection, in t r o duced recen tly by Hutter (200 7); Hutter and T ra n (2010). LoRP selects a mo del that has the smallest loss r a nk . The loss rank of a mo del is defined as t he num b er of other “fictitio us” data that fit the mo del b etter tha n the tra ining data (see Section 2 for a fo rmal in t r o duction). It w as sho wn b y Hutter and T ran (2010) that minimizing the lo ss rank is a suitable criterion for mo del selection, since it trades off b et w een the qual- it y of fit and the mo del complexit y . LoRP seems to be a promising principle with enormous p otential, leading to a ric h field. T ran (2009) demonstrated the use of LoRP for selecting the ridge parameter in ridge regression. T ran and Hutter (2010) adapted the idea of LoRP for mo del selection in a classification context and sho w ed its close connection with excellen t model selection tec hniques ba sed on Rademac her complexities (Ko lt chinsk ii , 2001; Bartlett et al., 200 2). In this pap er, we shall sho w that L o RP also successfully applies to selecting the shrink age parameter fo r the purp ose of v a riable selection. The main contribution of this pap er is to prop ose a criterion, called t he loss rank (LR) criterion, f or selecting shrink age parameters fo r v ariable selection pur- p oses. As long a s the regularization pr o cedure in use has the consistency prop ert y 3 (2), the shrink age parameter selected b y the LR criterion will pro duce the tr ue mo del asymptotically with probability 1. This mo del selection consistency of the prop o sed criterion will b e pro v en t heoretically in the case where the nu m b er o f cov ariates d is fixed and smaller than n . F or cases with d ≫ n , our simulation study suggests that this prop erty still ho lds. The sim ulation also show s that our metho d for v ariable selection w orks surprisingly w ell. Benefiting from f ast l 1 -regularization algorithms, our metho d is able to correctly iden tify significan t v aria bles from thousands of can- didates in sev era l CPU seconds. The paper is org anized as follow s. The main idea of LoRP is briefly review ed in Section 2. The LR criterion is derive d and its mo del selection consistency is pro ve n in Section 3. Sim ula tion studies and real-data application are presen t ed in Section 4. Section 5 con tains the conclusions and outlo ok. The pro ofs are relegated to the app endix. 2 The loss rank pri n ciple In t his section, w e giv e a brief review of the loss rank principle (LoRP). The reader is referred to Hutter (2007); Hutter and T ran (2010) for the details. Let us consider a tra ining data set D = ( x , y ) = { ( x 1 ,y 1 ) ,..., ( x n ,y n ) } ∈ ( X × Y ) n from a regression mo del y i = f ( x i ) + ǫ i . W e first consider discrete Y . Supp ose that w e use a mo del M to fit the data D , e.g., M is a linear regression mo del with d co v a r ia tes, o r M is a k -nearest neigh b ors regression mo del. Imagine that in exp erimen t situations we can conduct the exp erimen t man y times with fixed design p oints x . W e then w ould get man y other (fictitious) o utput y ′ . Observ e that if the mo de l M is complex/flexible (large d , small k ), then M fits the t r aining data ( x , y ) w ell and it also fits ( x , y ′ ) w ell (with resp ect to some loss function). Here, for simplicit y , we only consider the squared loss Loss M ( y | x ) = k y − ˆ y k 2 = P n 1 ( y i − ˆ y i ) 2 where ˆ y is the fitted vec tor under mo del M . Therefore the loss rank of M defined by Rank M ( D ) := # { y ′ ∈ Y n : Loss M ( y ′ | x ) ≤ Loss M ( y | x ) } will b e la rge for complex M . Conv ersely , as argued by Hutter and T ran (201 0), if M is small/rigid, that b oth Loss M ( y | x ) and Loss M ( y ′ | x ) are large also leads to a large loss rank. Th us, it is nat ural to c ho ose a mo del with the smallest loss rank as a go o d mo del. By doing this, we trade o ff b et w een the fit and the mo del complexit y . In the case of contin uous Y , sa y fo r instance Y = R , it is natural to use the concept of v olume instead of the coun ting measure in the definition o f loss rank, i.e., Rank M ( D ) := V ol { y ′ ∈ R n : Loss M ( y ′ | x ) ≤ Loss M ( y | x ) } . Consider the case o f linear mo dels where the fitted v ector ˆ y is linear in y , i.e., ˆ y = M ( x ) y where the regression mat r ix M = M ( x ) dep ends only on x (using the 4 same sym b ol M f or b o th mo del and regress ion matr ix will not cause an y confusion in the follow ing). Then Loss M ( y | x ) = k y − M y k 2 = y ⊤ A y with A = ( I n − M ) ⊤ ( I n − M ) and the loss rank is Rank M ( D ) = V ol { y ′ ∈ R n : y ′ ⊤ A y ′ ≤ L } where L := y ⊤ A y . Supp ose at the moment that det( A ) 6 = 0. The set { y ′ ∈ R n : y ′ ⊤ A y ′ ≤ L } is an ellipsoid in R n , so tha t its v olume is Rank M ( D ) = v n L n/ 2 √ det A where v n = π n/ 2 / Γ( n 2 + 1) is the v olume of the unit sphere in R n . Because the loga- rithm is monotone increasin g and v n dep ends only on n , it is equiv alent t o consider LR M ( D ) = n 2 log( y ⊤ A y ) − 1 2 log det A. (3) Principle 1. Given a class o f line ar mo dels M = { M } , the b est mo del among M is the on e with the sma l lest los s r a nk M best = argmin M ∈M { n 2 log( y ⊤ A y ) − 1 2 log det A } (4) wher e A = ( I n − M ) ⊤ ( I n − M ) , pr ovid e d that det A > 0 . No w we consider the case where det A = 0 (e.g., pro jectiv e regression) or A is nearly singular (e.g., ridge regression when the ridge parameter is v ery close to 0). In suc h cases, Rank M ( D ) is infinity or extremely large. F ollo wing the principle o f ridge regression, we add, in order to preve n t the lo ss rank from b eing infinit y or extremely large, a small penalty α k y k 2 to the loss Loss α M ( y | x ) := k ˆ y − y k 2 + α k y k 2 = y ⊤ S α y , S α = A + αI n where α > 0 is a small n um b er to b e determined la t er. No w, S α b eing not singular yields Rank α M ( D ) = V ol { y ′ ∈ R n : y ′ ⊤ S α y ′ ≤ L } = v n L n/ 2 √ det S α where L := y ⊤ S α y . T aking lo garithm and neglecting a constan t indep endent of M , w e define the loss rank of mo del M (dep enden t o n α ) as LR α M ( D ) = n 2 log( y ⊤ S α y ) − 1 2 log det S α . (5) Ho w do w e deal with the extra par ameter α ? W e are seeking a mo del of smallest loss rank, so it is natural to minimize LR α M ( D ) in α (see Hutter and T ran (201 0) for a more detailed in terpretation). Therefore, w e finally define the loss rank of mo del M as LR M ( D ) = inf α> 0 LR α M ( D ) = inf α> 0 { n 2 log( y ⊤ S α y ) − 1 2 log det S α } . (6) 5 Principle 2. (Hutter and T r an, 20 1 0) Given a c lass of line ar mo dels M = { M } , the b est mo del amo n g M is the one with the smal lest l o ss r ank M best = argmin M ∈M LR M ( D ) = argmin M ∈M inf α> 0 { n 2 log( y ⊤ S α y ) − 1 2 log det S α } (7) wher e S α = A + αI n = ( I n − M ) ⊤ ( I n − M ) + αI n . Man y attractiv e prop erties of LoRP ha ve b ee n p ointed out: LoRP reduces to Ba y esian mo del selection in some sp ecial cases and has an interpretation in terms of MDL principle (Hutter and T ran, 2010); in t he classification con t ext, L oRP has a close connection with (and w orks in some cases b etter than) mo del selection tec h- niques based on Rademac her complexities (T ra n and Hutter, 2010). By virtue of LoRP , the loss r a nk criterion for selecting the Lasso parameter will b e deriv ed in the next section. 3 The LR crit e rion Let us go bac k to the v ariable selection problem in linear regression analysis. W e build on the notation of Hutter and T ran (2010). Given a (larg e) set of d p oten t ial co v ar ia tes X 1 ,...,X d and a resp onse v ariable Y , w e consider the problem of c ho osing the imp ortan t co v ariates for explaining Y a s a linear function of X 1 ,...,X d . It is as- sumed as usual that the co v a r iates ar e linearly indep enden t and that E ( Y | X 1 ,...,X d ) is a linear com bination of X 1 ,...,X d with some of the co efficien ts are zero. W e are primarily in terested in iden tifying the non-zero coefficien ts. Supp ose that the resp onse v ector y and the design matrix X hav e b een cen tered, so that the interce pt is omitted from mo dels . Denote b y S T = { j ∗ 1 ,...j ∗ d ∗ } a nd S = { j 1 ,...j d 0 } the true mo del and a candidate mo del, respectiv ely . Under mo de l S , we can write y = X β S + σ ǫ where E ( ǫ ) = 0 , co v( ǫ ) = I n , σ > 0. W e shall consider β S = ( β S 1 ,...,β S d ) ⊤ as a p o in t in R d with β S j = 0 if j 6∈ S . Denote b y Θ( S ) := { θ S = ( β S ,σ 2 ) ∈ R d × R + } the parameter space of mo del S and b y X S the ( n × d 0 ) design matrix obtained from X by remo ving the j th column fo r all j 6∈ S . 3.1 The L R criterion Let ˆ β λ = ( ˆ β λ 1 ,..., ˆ β λ d ) ⊤ b e the regularized estimator of β w.r.t. a certain shrink age parameter λ , i.e., ˆ β λ is the solution of (1). Denote by S λ = { j : ˆ β λ j 6 = 0 } the index set corresp onding to the non-zero coefficien ts, b y df λ = |S λ | the n umber of non-zero co efficien ts, a nd b y X S λ the design matrix corresp onding to the selected cov ariates. W e a ssume at the momen t that df λ ≤ n and further assume that matrices X S λ are full rank. The case where df λ > n will b e dealt with later on. 6 Fitting mo del S λ b y least squares, we denote the OLS estimator and the v ar ia nce estimator b y ˆ β S λ = ( X ⊤ S λ X S λ ) − 1 X ⊤ S λ y ; ˆ σ 2 S λ = 1 n k y − X S λ ˆ β S λ k 2 , resp ectiv ely . The fitted vector under model S λ ˆ y S λ = X S λ ˆ β S λ = M S λ y with M S λ := X S λ ( X ⊤ S λ X S λ ) − 1 X ⊤ S λ is, conditionally on S λ , linear 1 in y . Then from (5), the loss rank of mo del S λ with parameter α is LR α λ ≡ LR α S λ = n 2 log( y ⊤ S λ α y ) − 1 2 log det( S λ α ) where S λ α = ( I − M S λ ) ⊤ ( I − M S λ ) + αI = (1 + α ) I − M S λ . Bec ause pro jection matrix M S λ has df λ eigen v alues 1 and n − df λ eigen v alues 0, S λ α has df λ eigen v alues α and n − df λ eigen v alues 1 + α . Thus , det S λ α = α df λ (1 + α ) n − df λ . Let ρ λ := k y − ˆ y S λ k 2 / k y k 2 , w e ha v e LR α λ = n 2 log y ⊤ y + n 2 log( ρ λ + α ) − df λ 2 log α − n − df λ 2 log(1 + α ) . T aking deriv ative w.r.t α , it is easy to see that LR α λ is minimized a t α m = ρ λ df λ (1 − ρ λ ) n − df λ pro vided that 1 − ρ λ > df λ /n . This condition is ensured b y Assumption (A3) b elow . Finally , aft er some algebra, the loss rank of mo del S λ as defined in (6) can b e explicitly expressed as LR λ = LR α m λ = n 2 log k y k 2 − n 2 KL( df λ n k 1 − ρ λ ) . (8) where K L ( p k q ) = p log p q + (1 − p )log 1 − p 1 − q is the Kullback-Leibler div ergence b etw een the Bernoulli distributions with parameters p,q ∈ (0 , 1). The optimal shrink age pa- rameter(s) λ (for v ariable selection purp oses) ch osen b y the LR criterion will b e ˆ λ LR ∈ argmin λ ≥ 0 LR λ = argmax λ ≥ 0 KL( df λ n k 1 − ρ λ ) . (9) Often, LR λ reac hes its minim um in an in terv al ( ˆ λ l , ˆ λ u ) (see Figure 1 ). An y λ in this interv al pro duce s the same mo del. This can b e explained as follo ws. When λ increases from 0 to infinit y , the n um b er of non-zero co effic ien ts of ˆ β λ will b e a non-increasing step function o f λ (Efro n et al., 2004); in other words, the co v ariates are in turn remov ed from the mo dels. As a result, b y its definition LR λ is also a step function. Note that our emphasis is on v ariable selec tion rather than on co efficien t estimation. 1 Strictly sp ea king, ˆ y S λ is not linear in y b ecause S λ depe nds on y . How ever, we can co nsider preselected s ubsets S λ as fixed models . If instea d w e firs t derive the LR c r iterion for a general fixed mo del S and then apply to S λ , we get the same results. 7 3.2 Mo d el selection consistency for fix ed d In order t o prov e the mo del selection consistency of the L R criterion, w e assume in this section that d is fixed and d ≤ n . W e need the follo wing assumptions (A1) There exists a deterministic sequence of reference shrink age parameters λ n suc h t ha t S λ n → S T w.p.1. (A2) ǫ is Gaussian N (0 ,I n ). (A3) F or each candidate λ , ρ λ is b ounded aw a y from 0 and 1, i.e., there are constan ts c 1 , c 2 suc h t ha t 0 < c 1 ≤ ρ λ ≤ c 2 < 1 w.p.1. Commen ts. ρ λ = k y − ˆ y S λ k 2 / k y k 2 is a measure of fit. In extreme cases where t he resulting mo de l S λ is to o big or to o small, ρ λ will b e close to 0 a nd 1, respective ly . Therefore, it is reasonable to consider only λ in whic h ρ λ is b ounded aw ay from 0 and 1. Note that for eve ry S λ w e hav e that ρ λ = k y − ˆ y S λ k 2 k y − ˆ y S λ k 2 + k ˆ y S λ k 2 = ˆ σ 2 S λ ˆ σ 2 S λ + k ˆ y S λ k 2 /n . F or λ suc h that S λ is the true mo del S T , (A3) follow s from a mild sufficien t condition 0 < lim inf n →∞ ( 1 n k ˆ y S T k 2 ) ≤ lim sup n →∞ ( 1 n k ˆ y S T k 2 ) < ∞ a nd ˆ σ 2 S T → σ 2 > 0 w.p.1 where ˆ y S T is the fitted v ector under the true mo del. Moreo v er, if the in tercept is included in the mo dels, w e ha ve that n ( ¯ y ) 2 ≤ k ˆ y S λ k 2 ≤ k y k 2 . ( A3 ) then follo ws from a v ery mild condition 0 < lim inf n →∞ ( ¯ y ) 2 ≤ lim sup n →∞ ( 1 n k y k 2 ) < ∞ a nd ˆ σ 2 S → constan t > 0 ∀S w.p.1 . Assumption (A1) is satisfied by some regularization pro cedure s, for example, La sso (Zhao and Y u, 2006) and SCAD (F an and Li, 2001). Normality assumption (A2) is not a necessary condition f or consistency . This assumption can b e relaxed, but then a more complicated pro of tec hnique is needed. W e hav e the follo wing lemma. Lemma 3. T he loss r ank of m o del S λ c an b e r ewritten as LR λ = n 2 log( n ˆ σ 2 S λ ) + n 2 H ( df λ n ) + df λ 2 log 1 − ρ λ ρ λ (10) wher e H ( p ) := − p log p − (1 − p )log(1 − p ) is the entr opy of p . Under Assumption (A3), the lo s s r a nk L R λ has the form LR λ = n 2 log ˆ σ 2 S λ + df λ 2 log n + n 2 log n + O P (1) , (11) wher e O P (1) denotes a b o unde d r andom v ariable w.p . 1 . 8 Pr o of. With ρ λ = n ˆ σ 2 S λ / k y k 2 , rearranging terms in (8) we get ( 10). The fact that H ( p ) p + log p → 1 as p → 0 implies that (note that df λ ≤ d and d is fixed) n 2 H ( df λ n ) = df λ 2 log n + df λ 2 (1 − log df λ ) + o (1) . Under Assumption ( A3 ), the la st term o f (1 0) is b o unded. This completes the pro of. The ab o v e lemma is used to prov e mo del selection consistency o f the LR criterion. Theorem 4 (Mo del selection consistency of the LR criterion) . Assume that d is fixe d. Under Assumption s (A1)-(A3), the shrinkage p ar ameter sele c te d by the LR criterion wil l pr o duc e the true mo del w.p.1 when n is lar ge enough, i . e ., P( S ˆ λ LR = S T ) → 1 wher e ˆ λ LR is determine d in (9) . The idea of the pro of is t o b ound the probabilities of pic king under- and ov erfitted mo dels. A mo del S is said to be underfitted if S misses at least o ne true co v aria te (i.e., S 6⊇ S T ), ov erfitted if S con tains a ll true cov ariates and at least one untrue (i.e., S ) S T ). There is a finite num b er o f suc h S , so it is sufficien t to prov e that P( S ˆ λ LR = S ) → 0 for eac h of them. The detailed pro of is relegated to the app endix. W e can of course use other mo del selection criteria rather than LoRP f o r choosing the b est subse t among the presele cted set pro duced b y the r egularization pro cedure. The most widely-used selection criteria in statistics are pro ba bly AIC (Ak aik e, 19 73) and BIC (Sc hw arz , 1978). AIC is asymptotically optimal in terms of loss efficiency but lik ely to select o verfitted mo dels, while BIC is a symptotically o ptima l in terms of mo del selection consistency; see Shao (1997); Y ang (2005). Therefore one may use BIC as another stopping rule b esides LoRP . The shrink age parameter chos en b y BIC will b e ˆ λ BIC ∈ argmin λ ≥ 0 BIC λ where BIC λ := n 2 log ˆ σ 2 S λ + df λ 2 log n. (12) W e see from L emma 3 that, up t o a constan t, the LR criterion is asymptotically equiv alen t to BIC. It follow s from the pro of of Theorem 4 that using BIC also leads to the same mo del selection consis tency , i.e., P( S ˆ λ BIC = S T ) → 1 as n → ∞ . How ev er, finite-sample simulation studies in the next section sho w that t he LR criterion w orks b etter than BIC, especially when d ≫ n . 3.3 Large d small n High-dimensional v aria ble selection pro blems in whic h d ≫ n is currently of great in t erest to scien tists. In order for such a problem to b e solv able, an essen tial as- sumption needed is that it is d ∗ − sparse (Candes and T ao, 200 7), i.e., the num b er 9 of true co v aria tes d ∗ m ust b e smaller than n . Under this solv ability assumption, it is clear that w e can safely ignore irrelev ant cases in whic h the n umber of co v ariates df λ under consideration is larger than n . Then the LR criterion (8) is still v alid. In practice, therefore, w e prop ose to ignore those λ under whic h df λ > n a nd apply the LR criterion as usual. A theoretically rig orous treatment is b ey ond the scop e of the presen t pap er, whic h w e in tend to do in a future pap er. How ev er, a system- atic simulation study in the next section suggests t ha t the LR criterion still w orks surprisingly w ell and enjo ys mo del selection consistency . 4 Numerical examples In this section, w e presen t sim ulation studies for the LR criterion, compare the LR criterion to o ther metho ds, and also apply it to a real data set. The regularization pro cedure w e use is Lasso. The Lasso solution paths are computed by the LARS algorithm of Efron et al. (20 04). A widely-used metho d for c ho o sing the Lasso parameter is GCV (Cra v en and W ah ba, 1979; Tibshirani, 1996) GCV λ = 1 n k y − X ˆ β λ k 2 (1 − 1 n DF λ ) 2 where DF λ := tr[ X ( X ⊤ X + λW − ) − 1 X ⊤ y ] , W = diag( | ˆ β λ j | ) and W − is a generalized in verse of W . Another one is the BIC-t yp e criterion of W ang et al. (2007) (although its v ariable selec tion consistency requires the or acle prop erty , a pro p ert y not enjo yed b y Lasso) g BIC λ = log k y − X ˆ β λ k 2 n + DF λ log n n . Note that ˆ β λ 6 = ˆ β S λ . The former is the Lasso estimator whereas the latter is the OLS estimator resulting from fitt ing mo del S λ b y least squares. Our prop osed criteria (8) and (12) are constructed based on ˆ β S λ , not ˆ β λ . This is t he ess en tial difference b et w een our approac h and the others. Example 1: small d . W e consider the follow ing example whic h is tak en fro m Tibshirani (1996): y = x ⊤ β + σ ǫ where β = (3 , 1 . 5 , 0 , 0 , 2 , 0 , 0 , 0) ⊤ , x i are marginally N (0 , 1 ) with the correlation b et w een x i and x j equal to 0 . 5 | i − j | , ǫ ∼ N (0 , 1). W e compare the perfo rmance of LR and BIC criterion to that of GCV and g BIC. The p erformance is measured by the frequency of underfitting, o ve rfitting and correct fitting and av erage n um b er o f zero co efficien ts o ver 10 0 replications. T able 1 summarizes the simu lation results for v arious fa cto r s n a nd σ . Althoug h g BIC w orks slightly b etter t ha n GCV, it still pro duces ov erfitted mo dels most of the time. BIC do es a go o d job and LR outp erfo r ms the others. 10 T able 1: The small- d case σ n Metho d Under- C orrectly Ov erfitted(%) Av e. No. fitted(%) fitted(%) of zeros 1 100 GCV 0 0 100 1.57 g BIC 0 3 97 2.32 BIC 0 89 11 4.88 LR 0 97 3 4.97 200 GCV 0 0 100 1.64 g BIC 0 0 100 1.81 BIC 0 94 6 4.93 LR 0 100 0 5 3 100 GCV 0 0 100 1.34 g BIC 0 0 100 1.53 BIC 1 70 29 4.22 LR 1 77 22 4.37 200 GCV 0 0 100 1.69 g BIC 0 0 100 2.09 BIC 0 91 9 4.89 LR 0 91 9 4.90 Example 2: large d . W e consider cases of large d in this example with d = 30 0 and n = 10 0 , 20 0 , 500. W e set up a s p arse r e c overy pr oble m in whic h most of co efficien ts a r e zero exce pt β 30 = β 60 = ... = β 300 = 10. The design matrix is simu lated as in Example 1. T able 2 summarize s the sim ulatio n results for v arious factors n = 100 , 20 0 , 500 and σ = 1 , 3. The LR criterion works surprisingly w ell in comparison with BIC and t he others. Let us tak e a closer lo ok at the simulation results in T a bles 1-2. Although the LR and BIC criteria are asymptotic al ly equiv alent to eac h o ther, the finite-sample sim ulation study show s that the L R criterion w orks b etter than BIC. A similar situation w as also observ ed in Hutter and T ran ( 2 010) for subset selection. This is probably b ecause, con trarily to the BIC criterion, the p enalt y term of t he LR criterion is data- adaptiv e. Some results in the mo del selection literature sho w that selection criteria with data - adaptiv e p enalties are more encouraging than those with deterministic p enalties ; see Y ang (2005) and references therein. W e see that BIC seems to br eak down fo r the cases d > n as it alw ays pro duce s ov erfitted mo dels, but starts w orking w ell when n > d . The O P (1) t erm in (11) pla ys an imp ortant role here: it serve s as a “corr ector” to BIC. Note that BIC is just an a ppro ximation to the logarithm of p osterior mo del probabilit y (Sch w a rz, 1978), the appro ximation migh t b e inaccurate if n is not large enough relativ e to d . Example 3: P rostate cancer data. W e consider a real data set in this example. 11 T able 2: The large- d case σ n Metho d Under- C orrectly Ov erfitted(%) Av e. No. fitted(%) fitted(%) of zeros 1 100 GCV 0 0 100 90.20 g BIC 0 0 100 95.8 BIC 0 0 100 202.01 LR 0 30 70 288.24 200 GCV 0 0 100 87.51 g BIC 0 0 100 89.45 BIC 0 0 100 102.02 LR 0 86 14 289.83 500 GCV 0 0 100 97.51 g BIC 0 0 100 104.45 BIC 0 40 60 287.30 LR 0 100 0 290 3 100 GCV 0 0 100 78.35 g BIC 0 0 100 87.40 BIC 0 0 100 202.04 LR 0 18 82 287.51 200 GCV 0 0 100 92.02 g BIC 0 0 100 96.51 BIC 0 0 100 102.01 LR 0 58 42 289.29 500 GCV 0 0 100 93.31 g BIC 0 0 100 96.52 BIC 0 35 65 288.35 LR 0 80 20 289.75 Stamey et al. (1989) studied the correlation b etw een t he lev el of pr o state antigen ( lpsa ) and a n umber of clinical measures in men: log cancer v olume ( lc avol ), log prostate weigh t ( lweig h t ), age , log of the amoun t of b enign prostatic h yp erplasia ( lbph ), seminal v esicle inv asion ( svi ), log of capsular p enetration ( lcp ), Gleason score ( gle ason ), and p ercen tage of Gleason scores 4 or 5 ( p gg45 ). F ollow ing Tibshirani (1996), we assume a linear r egr ession mo del b etw een the resp onse lpsa and the 8 co v ar ia tes. W e w an t t o select a parsimonious mo del for the sak e of scien tific insigh t in t o the resp onse-cov ariate relatio nship. The data set o f size 97 is standardized so that the in tercept β 0 is excluded. Figure 1 presen ts the curv es GCV λ , g BIC λ , LR λ (1000 v alues of λ ranging from 0.01 to 1 0 in incremen ts of .0 1 were used to search for the optimal λ ). The λ selected b y GCV , g BIC are . 5 and 1 . 1, and the corresp onding mo dels ar e { 1 , 2 , 3 , 4 , 5 , 7 , 8 } , 12 0 2 4 6 8 10 −1.5 −1 −0.5 0 0.5 1 1.5 2 tuning parameter λ LR GCV BIC Figure 1: Prostate cancer data: LR λ , g BIC λ and GCV λ . { 1 , 2 , 3 , 4 , 5 , 8 } , resp ectiv ely . The L R criterion is minimized in the in terv al (3 . 1 , 5 . 9). An y v alue in this in terv al pro duces the same mo del S LR = { 1 , 2 , 5 } . The BIC o f these mo dels are − 19 . 20 , − 21 . 38 , − 25 . 19, resp ectiv ely . That means the BIC also supp orts the c hoice of the LR criterion. (Note ho wev er tha t this do es not mean that the BIC is an optimal criterion). 5 Conclus ions and outl o ok Regularization pro ce dures are efficien t metho ds for v a riable selection, sub ject to a prop er choice o f shrink ag e par ameter. By virtue of LoRP , a general- purp ose principle for mo del selection, the LR criterion for v ariable selection in linear regression a nalysis w as prop ose d. V ariable selection consistency of the suggested criterion w as p ointe d out theoretically and exp erimen tally . Bot h theoretical and exp erimen tal results sho w that the prop osed criterion is a v ery encouraging pro cedure for v ariable selection problem, esp ecially in high-dimensional settings. Regularization pro cedures hav e no w b een extended to genaralized linear mo dels and b ey ond, w e in tend to extend our approac h to suc h fra mew orks in future w ork. ————— —— M. N. T r an, Dep artment of Statistics and Applie d Pr ob ability, National University of Singap or e, Singap or e 117546. E-mail: ngo c tm@nus.e du.sg 13 App endix Pr o of of The or em 4. The main idea of the pro of is taken from Cham baz (2006). Let us denote b y z i = ( x i 1 ,...,x id ,y i ) the i -th o bserv ation and b y γ ( .,m,σ 2 ) the density of t he Gaussian distribution with mean m and v aria nce σ 2 . Under mo del S , t he densit y of z i is p θ S ( z i ) = γ ( y i , P j ∈S β j x ij ,σ 2 ). The log-lik eliho o d is l n ( θ S ) = n X i =1 log p θ S ( z i ) = − n 2 log(2 π ) − n 2 log σ 2 − 1 2 σ 2 n X i =1 ( y i − X j ∈S β j x ij ) 2 . It is easy to see that sup θ ∈ Θ ( S ) l n ( θ ) = − n 2 log ˆ σ 2 S − n 2 (1 + log (2 π )) . By (11), the loss rank of mo del S λ no w can b e written as LR λ = − sup θ ∈ Θ ( S λ ) l n ( θ ) + df λ 2 log n + C ( n ) + O P (1) where the constan t term C ( n ) = n 2 log n − n 2 (1 + log(2 π )) is indep enden t of S λ . No underestimation. It is sufficie n t to prov e that P( S ˆ λ LR = S ) → 0 f or eac h S 6⊇ S T , as there is o nly a finite n um b er of suc h S . P( S ˆ λ LR = S ) = P( S ˆ λ LR = S , LR ˆ λ LR ≤ LR λ n ) = P  1 n sup θ ∈ Θ ( S ˆ λ LR ) l n ( θ ) − 1 n sup θ ∈ Θ ( S λ n ) l n ( θ ) ≥ log n 2 n (df ˆ λ LR − df λ n ) + o P (1) , S ˆ λ LR = S  ≤ P  1 n sup θ ∈ Θ ( S ) l n ( θ ) − 1 n sup θ ∈ Θ ( S λ n ) l n ( θ ) ≥ log n 2 n ( |S | − df λ n ) + o P (1)  ≤ P  1 n sup θ ∈ Θ ( S ) l n ( θ ) − 1 n sup θ ∈ Θ ( S T ) l n ( θ ) ≥ log n 2 n ( |S | − d ∗ ) + o P (1)  + P ( S λ n 6 = S T ) ≤ P  1 n sup θ ∈ Θ ( S ) l n ( θ ) − 1 n l n ( θ ∗ ) ≥ log n 2 n ( |S | − d ∗ ) + o P (1)  + P ( S λ n 6 = S T ) (13) where θ ∗ ∈ S T denotes the true parameter. By the la w of large n um b ers fo r t he suprem um of the likelih o o d ra tios (see, e.g., Lemma B1 of Cham baz (2006)) 1 n sup θ ∈ Θ ( S ) l n ( θ ) − 1 n l n ( θ ∗ ) → − inf θ ∈ Θ ( S ) KL( p θ ∗ k p θ ) w.p.1. Because S 6⊇ S T , inf θ ∈ Θ ( S ) KL( p θ ∗ k p θ ) > 0. This, t o gether with the fact that log n 2 n ( |S | − d ∗ ) → 0 and Assumption (A1), sho ws that the left-hand side term of (13) go es to 0 as n → ∞ . No ov erestimation. Fix an o v erfitted mo del S ) S T , let us denote by H ( θ ) := KL( p θ ∗ k p θ ) = E [ 1 n ( l n ( θ ∗ ) − l n ( θ ))] ≥ 0 ∀ θ ∈ Θ ( S ) 14 ( H ( θ ) is not necessarily p ositiv e) and h n ( θ ) := l n ( θ ) − l n ( θ ∗ ) H ( θ ) 1 / 2 with conv en t io n 0 0 = 0. F or ev ery θ ∈ Θ ( S ) l n ( θ ) − l n ( θ ∗ ) + nH ( θ ) = l n ( θ ) − l n ( θ ∗ ) − E [ l n ( θ ) − l n ( θ ∗ )] = H ( θ ) 1 / 2 ( h n ( θ ) − E h n ( θ )) ≤ H ( θ ) 1 / 2 sup ν ∈ Θ( S ) ( h n ( ν ) − E h n ( ν )) . (14) By Θ( S T ) ⊂ Θ( S ) and the pro p ert y of suprem um, for eve ry ǫ > 0 there exists θ 0 ∈ Θ( S ) suc h t ha t sup θ ∈ Θ ( S ) ( l n ( θ ) − l n ( θ ∗ )) ≤ l n ( θ 0 ) − l n ( θ ∗ ) + ǫ (15) and also l n ( θ 0 ) − l n ( θ ∗ ) ≥ 0 . (16) F rom ( 1 5) and (14) sup θ ∈ Θ ( S ) ( l n ( θ ) − l n ( θ ∗ )) ≤ H ( θ 0 ) 1 / 2 sup θ ∈ Θ ( S ) ( h n ( θ ) − E h n ( θ )) + ǫ. (17) F rom ( 1 6) and (14) nH ( θ 0 ) ≤ l n ( θ 0 ) − l n ( θ ∗ ) + nH ( θ 0 ) ≤ H ( θ 0 ) 1 / 2 sup θ ∈ Θ ( S ) ( h n ( θ ) − E h n ( θ )) or nH ( θ 0 ) 1 / 2 ≤ sup θ ∈ Θ ( S ) ( h n ( θ ) − E h n ( θ )) . (18) No w, since ǫ > 0 was c hosen arbitrarily , (17) and (18) yield sup Θ( S ) l n ( θ ) − sup Θ( S T ) l n ( θ ) ≤ sup Θ( S ) { l n ( θ ) − l n ( θ ∗ ) } ≤ 1 n sup θ ∈ Θ ( S ) ( h n ( θ ) − E h n ( θ )) ! 2 . (19) W e need the following b ounded law of the iterated logarithm whic h is a consequenc e of Theorem 4.1, D udley and W.Philipp (1983) or L emma B2, Chambaz (2006). Lemma 5. T her e is a finite c onstant C s o that lim sup n sup θ ∈ Θ ( S ) | h n ( θ ) − E h n ( θ ) | √ n log log n ≤ C w.p.1 . No w for eve ry ov erfitted mo del S ) S T , it is sufficien t to prov e that P( S ˆ λ LR = 15 S ) → 0. In fact, P( S ˆ λ LR = S ) = P( S ˆ λ LR = S , LR ˆ λ LR ≤ LR λ n ) ≤ P sup Θ( S ) l n ( θ ) − sup Θ( S λ n ) l n ( θ ) ≥ log n 2 ( |S | − df λ n ) + O P (1) ! ≤ P sup Θ( S ) l n ( θ ) − sup Θ( S T ) l n ( θ ) ≥ log n 2 ( |S | − d ∗ ) + O P (1) ! + P( S λ n 6 = S T ) = P h log lo g n d ∗ 2 log n ih sup Θ( S ) l n ( θ ) − sup Θ( S T ) l n ( θ ) log lo g n i ≥ |S | d ∗ − 1 + o P (1) ! + P( S λ n 6 = S T ) ≤ P h log lo g n d ∗ 2 log n ih sup Θ( S ) | h n ( θ ) − E h n ( θ ) | √ n log log n i 2 ≥ |S | d ∗ − 1 + o P (1) ! + P( S λ n 6 = S T ) (20) where the last inequality f o llo ws from (19). Observ e that |S | > d ∗ as S ) S T . This, together with Lemma 5 and the fact that loglo g n/ ( d ∗ 2 log n ) → 0, implies that the first probability o f (20) go es to zero. The second proba bility of (20) also go es to zero b ecause of Assu mption (A1) . This completes the pro of. References Ak aike H. (1973). Information theory and an extension of the maxim um lik eliho od principle. In Pr o c. 2nd International Symp osium on I n formation The ory , pages 267–281, Budap est, Hungary , Ak ademiai Kaid´ o. Bartlett P ., Bo uc heron S. and Lugo si G. (2002) . Model selection and error estima- tion. Machine L e arni ng , 48, 8 5 –113. Burnham K. P . and Anders on D. (2002). Mo del se l e ction and multimo del infer enc e: a pr actic al information-the or etic appr o ach . New Y ork, Springer. Candes E. and T ao T. (2007) . The dantz ig selector: statistical estimation when p is muc h larger than n (with discussion). The Annals of Statistics , 35, 2313 – 2351. Cham baz A. (2006). T esting the order of a mo del. The Annals of Statistics , 34, 1166–120 3. Cra ven P . a nd W a h ba G. (1979). Smo othing no isy dat a with spline functions: es- timating the correct degree o f smo o thing by the metho ds of generalized cross- v alida tion. Numerische Mathematik , 31, 377–4 0 3. Dudley R . M. and Philipp W. (19 8 3). In v ariance principles for sums of banac h space v alued ra ndom elemen ts and empirical pro cesses . Z . Wahrsch. V erw. Gebiete , 62, 509–552. 16 Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004 ) . Least angle regression. The Annals of S tatistics , 32, 407–499. F an J. and Li R. (2001). V aria ble selection via nonconca v e p enaliz ed like liho o d and its o r a cle prop erties. Journal of the Americ an Statistic a l Asso cia tion , 96, 1348–136 0. F riedman J. H. (2008). F a st sparse regression and classification. URL h ttp://www- stat.stanford.edu/ jhf/ftp/GPSpap er.p df. Hutter M. (200 7 ). The loss rank principle for mo del selection. In Pr o c. 20th Annual Conf. on L e arning The ory (COL T’07) , v olume 4539 of LNAI , 589– 603, San Diego, Springer, Berlin. Hutter M. and T ran M. N. (2010). Model selection with t he loss rank principle. Computational Statistics and D ata Analysis , 54, 1288–1 306. Koltc hinskii V. (200 1). Rademac her p enalties and structural risk minimization. IEEE T r an s a ction on Information The ory , 47, 1902–19 14. Leng C., Lin Y. and W ah ba G. (2006). A note on the lasso and related pro ce dures in model selection. Statistic a Sinic a , 16, 1273–1284 . Meinshausen N. and Buhlmann P . (20 0 6). Consisten t neigh b ourho od selection for high-dimensional graphs the lasso. T h e Annals of Statistics , 34, 14 36–1462. Miller A. (1990). Subset Se le ction in R e gr es sion . Chapman & Hall/CRC . P o etsc her B. M. and Leeb H. (2009). On the distribution of p enalized ma ximum lik eliho od estimators: The lasso, scad, and thresholding. Journal of Multivariate A nalysis , 100, 2065–2082 . Sc hw arz G. (1 978). Estimating the dimension of a mo del. The Annals of Statistics , 6, 461–464. Shao J. (1997). An asymptotic theory for linear mo del selec tion. Statistic a Sinic a , 7, 221–264. Stamey T., Kabalin J., McNeal J., Johnstone I., F reiha F., Redwine E. a nd Y ang N. (1989). Prostate sp ecific antigen in the diagnosis and t r eat ment of adeno carcinoma of the prostate ii. radical prostatectom y treated patien ts. Journal of Ur olo gy , 16, 1076–108 3. Tibshirani R. (1996) . Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety. Series B , 58, 267–288. 17 T ran M. N. (2009). P enalized maxim um lik eliho o d principle f o r c ho osing ridge parameter. C o mmunic ation in S tatistics: Simulation and Comp utation , 3 8, 1610– 1624. T ran M. N. and Hutter H. (201 0). Mo del selection b y loss ra nk for discrete data. Submitte d . W ang H., Li R. and Tsai C. L. (200 7 ). T uning parameter selectors for the smo o t hly clipp ed absolute deviation metho d. Biometrika , 94, 553–568. Y ang Y. (2005). Can the strengths of AIC and BIC b e shared? A conflict b etw een mo del inden tification and regression estimation Biome trika , 92, 937–950. Zhao P . and Y u B. (2006). O n mo de l selection consistency of lasso. Journa l of Machine L e arn ing R ese ar c h , 7, 2541 –2563. Zou H. (200 6 ). The adaptiv e lasso and its oracle prop erties . Journal of the Americ an Statistic a l Asso ciation , 101, 1418–14 2 9. Zou H., Hastie T. and Tibshirani R . (2007). On the degrees of freedom of the lasso. The Annals of S tatitics , 35, 2 173–2192 . 18

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment