Model Selection Consistency for Cointegrating Regressions

We study the asymptotic properties of the adaptive Lasso in cointegration regressions in the case where all covariates are weakly exogenous. We assume the number of candidate I(1) variables is sub-linear with respect to the sample size (but possibly …

Authors: Eduardo F. Mendes

Mo del Selection Consistency for Coin tegrati ng Regression s Eduardo F. Mendes Dep. of Statistics, North w estern Unive rsit y No v ember 1 , 2018 Abstract W e study the asymptotic pr operties of the a daptiv e Lasso in c oin teg ration regres sions in the case where all cov ariates are weakly exogeno us . W e assume the num ber of candidate I (1) v ariables is sub-linear with resp ect to the sample size (but p ossibly larger) and the n umber of c a ndidate I (0) v a riables is p o lynomial with resp ect to the sample size. W e show that, under classica l conditions used in coint egration analysis, this estimator asymptotically chooses the correct subset of v ariables in the mo del and its asymptotic di stribution is the same a s the distribution of the OLS estimate given the v ariables in the mo del w ere known in b eforehand (o racle prop ert y ) . W e a lso deriv e an algo rithm based on the lo cal quadratic approximation and present a numerical study to show the adequac y of the metho d in finite samples. 1 In tro duction With the increa sing acce ss to large datasets model select ion has b ecome a main issue in econo- metrics modeling and also in man y other areas. This problem i s traditionally attac k ed from one of the three p ersp ectiv es: sequen tial tests, information theoretic criteria and model shrink age. One can see that the first t w o are not w ell fitted for v ariable selection in higher dimension al settings and the later has not b een w ell adapted to the problems w e face in economic time series. The sequen tial testing metho d wo rks in a “general-to-specific” approac h. One starts with a large mo del and sequen tially eliminates unnecessary v ariables. A problem with this metho d is that when the num b er of regressors is large the p erformanc e of this metho d is sev erely compro- mised and m ulticoline arit y and s pur ious correlation are a hu ge issue. The information criteria approac h works by assigning we igh ts to the mo dels and then b y minimizing some risk function among the candidate mo dels. I n a v ariable selection con text, one w an ts to c ho ose the b est sub- set of v ariables, whic h leads to estimating appro ximately 10 p/ 3 distinct mo dels, and c ho ose the b est one according to some risk function. Clearly this metho d quic kly b ecomes not feasible and alternativ e methods, such as greedy mo del s ele ction is used instead. Greedy model selection, or sequen tial mo del selection, is not consisten t and frequently choose a lo cal minima among all mo del s. Another problem that mo del selection in high-dimension faces is that when the num b er of candidate v ariables is greater than the n um b er of observ ations, estimating the mo del is not f ea- sible b ecause the parameters are not iden tifiable. Mo del s hrink age, whic h has b een successf ul ly used in severa l areas, includin g computer s ci ence and genomics. The idea is to shrink to zero the co efficie n ts that do not matter in the regression leaving only the “relev an t” ones to b e esti- mated. One of the consequences is that only a subset of v ariables are actu ally estimated and therefore w e are able to handle more v ariables than observ ations. Among shrink age metho ds the Lasso, in tro duce d by Tibshirani (1996), has receive d m uc h atten tion and sev eral extensions 1 ha v e b een dev eloped, e.g. Hastie and Zou (2005), Zou (2006) and Y uan and Lin (2006) among man y others. The Lasso estimator is giv en b y ˆ θ = arg min θ k Y − X θ k 2 2 + λ k θ k 1 , (1) where θ is a p × 1 parameter v ector, Y is the dep end en t v ariable and X is the data matrix. It can b e sho wn that its en tire regularization path can b e efficient ly computed (Efron et al., 2004), can handle more cov ariates than observ ations and under some conditions can c ho ose the correct subset of relev an t v ariables (Zhao and Y u, 2006; W ain wrigh t , 2006 ; Meinshausen and Bühlmann, 2006; Meinshausen and Y u , 2009), ho wev er it is not consisten t in general and pro vide biased estimates for the non-zero parameters (F an and Li, 2001; Knigh t and F u, 2000; Zou, 2006). Zou (2006) prop osed a mo dificat ion that has the “orac le” property , meaning that the estimat or of the non-zero parameter s ha ve the same distribution as if w e k new them beforehand. This mo difi cation led to the adaptiv e Lasso given by ˆ θ = arg min θ k Y − X θ k 2 2 + λ p X j =1 λ j | θ j | , (2) where the w eigh ts λ j = | ˆ θ ∗ j | − ρ , 0 < ρ ≤ 1 , with ˆ θ ∗ j a consisten t estimate of the true parameter θ 0 j . Extension of shrink age estimators for the case the n um b er of candidate v ariables n is p os- sible m uc h large r than the s am ple size of te n require the “partial ortogonalit y condition” whic h states that the v ariables that do not en ter in the mo de l are only w eakly correlated with the v ariables that enter in the mo del (Huang et al., 2008, 2009), or the “I rrepresen table Condition” whic h states that the co efficien ts of the linear regression of the v ariables that en ter the mo del on to the v ariable that do not enter the mo del is b ounded by 1 (Zou, 2006; Zhao and Y u, 2006; Meinshausen and Bühlmann, 2006). Despite all these effort in understanding and adapting the La s so to distinct cases, most adv ances are only v alid for the classical i.i.d. regression framew ork, most often with fixed design. Little or effort has b een giv en to time s eri es or w eekly dependen t case, whic h is the prev alen t case in economic series. W ang et al. (2007) use a Lasso-based metho d to c ho ose the autoregressiv e order of a regression; Hs u et al. (2008) apply the Lasso metho d to c ho os e the v ariables in a v ector autoregressiv e mo dels; Caner (2009) applies the Lasso method to c ho ose v ariables in a w eakly dep ende n t GMM framewo rk; Caner and Knigh t (2008) use a bridge estimator to find the int egration order of a v ector; and Liao and Phillips (2010) for selecting v ariables and order of integra tion in an error correction mo dels. All those pap ers suffers f ro m the s a me dra wbac k that is the n um b er of candidate v ariables (or resp ectiv elly the total nu m be of parameters for the vecto r case) ha v e to b e s ma ller than the sample size. Song and Bic k el (2011) pro vide new results allowin g the n um b er of v ariables to increase with the sample size and b e p ossibly larger than it. Suc h techni ques ha v e also b een used in applied researc h in more genera l f ra mew orks . F or instance, Bai and Ng (2008) use Lasso-related tec hniques for factor forecasting , but since prediction is their ultimate goal (as opp osed to v ariable selection), what matters is ho w ordered predictors affect the forecasts as opp osed to how you c ho ose the v ariables. In this pap er w e discuss an ex tension of the adaptiv e Lasso to a (p ossibly) coin tegrated regression with expla natory stationary v ariables, and sho w mo del selectio n consistency and orac le prop er t y for the metho d. W e allo w the mo del to select b oth the stationary and non-stationary v ariables in the regression. One problem in extending Lass o to coin tegrated regressions is that the I (1) and I (0) parameters con v erge at distinct rates. W e o v ercome this problem b y setting 2 regularizatio n parameter for the I (1) v ariables to b e prop orti onal to the square of the λ for the stationary v ariables. W e also relax the need of a “zero-consisten t estimator“ in Huang et al. (2008), imp osing a wea k er form of the ”Irrepresen table Condition“. Throughout the paper w e assume it is already kno wn the order of in tegration of the dependen t and indep enden t v ariable s. W e consider the case where the actual n um ber of I (1) v ariables in the mo del, q 1 , is fixed, but the num b er of I (0) v ariables in the mo del, q 2 , can increase with T . Moreo ver, the total num b er of candidate I (1) v ariables is sub-linear with resp ect to the sample size T , meaning that the n um b er of candidate v ariables n 1 is o ( T ) , but p ossibly larger than T . This last condition can b e relaxed if more structure is imp osed on the error term of the regression, and w e can ac hiev e a rate for n as big as o ( e T δ ) , for some 0 ≤ δ ≤ 1 (Huang et al., 2008). Similarly , the n umber of candidate I (0) v ariables, n 2 , is o ( T d ) , for s ome d ≥ 1 . The results in this pap er can also b e ex tended to the (finite) v ector-case and also (indep endent ) panel data mo dels. One of the most straigh tforw ard application of this result is to understand the shift in prices of financial ob jects (financial p ort folio construction). The prices are k no wn to b e I (1) and n um b er of financial ob j ect s that migh t of intere st is large and include b oth I (1) and I (0) v ariables. Another in teresting framew ork is the evolu tion of macro economic time s eri es, as in Sto c k and W atson (2002). The n um ber of predictors can b e very large and an efficien t method for c hoosing the relev ant ones is necessary . Another application of this metho d is to choose the n um b er of lags in a Autoregressive Distributed Lags (ADL) mo del. In section 2 w e presen t the prop osed mo del selection metho d. Section 3 presen ts the main results of the pap er. Section 4 shows the algorithm for estimating the parameters and a Monte Carlo study to ev aluate the p erformanc e of the met ho d in finite samples. W e close the pap er with some final remarks in section 5. The proof of the main results are dela y ed to the app endix. 2 P ena lized Coin tegration Let { y t } ∞ 1 denote an scalar time series generated b y y t = α 0 + β ′ 0 x t + γ ′ 0 z t + u t (3) where α 0 is a scalar, β 0 is n 1 × 1 , and γ 0 is n 2 × 1 , with the index · 0 meaning “true”. The pro cess { x t } ∞ 1 satisfies x t = x t − 1 + v t , (4) the process { z t } ∞ 1 has mean zero and is weakly station ary , and { u t } ∞ 1 and { v t } ∞ 1 are w eakly stationary error pro cesses. Also, the followi ng assumption hold for the vector w t = ( u t , v ′ t , z t ) ′ Assumption 1 (DGP) . The ve ctor pr o c ess { w t } ∞ 1 satisfy the fol lowing assumptions 1. E w t = 0 for t = 1 , 2 , . . . ; 2. { w t } ∞ 1 is we akly stationary; 3. for some d > 1 • E | w t | 2 d < ∞ for t = 1 , 2 , . . . ; and • the pr o c ess { w t } ∞ 1 is either φ -mixing wi th r ate 1 − 1 / (2 d ) , or α -mixing with r ate 1 − 1 /d . 4. The pr o c ess { u t } t 1 is unc orr elate d wi t h { v t } t 1 and { z t } t 1 , for t = 1 , 2 , . . . 3 5. Define S t = P T 1 w t w ′ t . Then lim T →∞ T − 1 E S t S ′ t = E w 1 w ′ 1 + P ∞ t =1 E [ w 1 w ′ t + w t w ′ 1 ] = Σ + Λ + Λ ′ = Σ ∗ . . 6. max j =1 ,...,n 2  E  | T 1 / 2 P T t =1 z j t u t |  2 d  1 /d ≤ c d < ∞ 7. if q 2 → ∞ , max 1 ≤ i ≤ j ≤ q 2 E  T − 1 / 2 P t t =1 z it z j t − E ( z it z j t )  2 ≤ c s < ∞ 8. the eigenvalues of the m atriz Σ ∗ Z (1) 2 (the p art of Σ ∗ c orr esp onding to the variables z that enter in the mo del) ar e b ounde d b etwe en τ ∗ and τ ∗ . The set of ass um ptions (1)–(5) is common in coin tegration re gr ession. Assumptions (6) and (7) are required to con trol the n um b er of I (0) v ariables in the mo del. In particular, Phillips and Durlauf (1986) make de same set of assumptions (1-5) to derive asymptotic prop er- ties of m ultiple regressions with int egrated processes. This assumption is required to ensure that the Inv ariance Principle holds. A w eaker set of assumptions, using mixingales, could b e used instead (de Jong and Da vidson, 2000), but w e decided to us e the classical set of assumptions for sak e of simplicit y (and clarity since these are the most commonly used). The n um b er of finite momen ts d is directly related to the order of increase of candidate v ariables in the mo del. In this w ork, we assume that n ≡ n T = n 1 + n 2 is p ossibly greater than T , but only a fraction of these co efficien ts are in fact nonzero. Without any loss of generalit y we assume each co effic ien t ve ctors can b e partitioned in to zero and non-zero co effici en ts, i.e. β 0 = ( β 0 (1) ′ , β 0 (2) ′ ) ′ and γ 0 = ( γ 0 (1) ′ , γ 0 (2) ′ ) ′ , with all non-zero co efficien ts stac k ed first, where β 0 (1) is q 1 × 1 , and γ 0 (1) is q 2 × 1 . W e assume q 1 is fixed (do nor dep end on T ) and q 2 ma y dep end on T, also set q = q 1 + q 2 . F or matter of con v enience, denote m 1 = n 1 − q 2 and m 2 = n 2 − q 2 . Denote by upper case letters the data matrices and allo w splitting these matrices in the same wa y w e did with the co effic ien ts, for instance Z = ( z 1 , . . . , z T ) ′ = ( Z (1) , Z (2)) and X = ( x 1 , . . . , x T ) ′ = ( X (1) , X (2)) . The A daptiv e Lasso estimate in our case is given by ( ˆ β , ˆ γ ) = arg m in β ,γ k Y − X β − Z γ k 2 2 + λ 1 n 1 X j =1 λ 1 j | β j | + λ 2 n 2 X j =1 λ 2 j | γ j | , (5) where { λ 1 , λ 11 , . . . , λ 1 n 1 , λ 2 , λ 21 , . . . , λ 2 n 2 } are regularization parameters satisf ying a set of con- ditions defined later, and k · k 2 2 denote the L 2 -v ector norm. F ollo wing Zou (2006), w e tak e λ 1 j = | ˆ β ∗ j | − ρ and λ 2 j = | ˆ γ ∗ j | − ρ , where ˆ β ∗ j and ˆ γ ∗ j are estimators of β 0 j and γ 0 j ; and 0 ≤ ρ < 1 . W e assume without loss of generalit y that the true in tercept α 0 = 0 is kno wn. This as- sumption does not c hange our results since we are in terested in the b ehavior of the selection pro ced ure. W e make the follow ing regul arit y as s um ptions ab out the parameter space Θ n and the true v ector of parameters θ 0 = ( β ′ 0 , γ ′ 0 ) ′ . Assumption 2 . (i) The true p ar ameter ve ctor θ 0 is an element of an op en subset Θ n ⊂ R n that c ontains the element 0 . (ii) m in β 0 (1) ≥ β ∗ and m in γ 0 (1)) ≥ γ ∗ . The minimization problem in (5 ) is equiv alen t to a constrained conca v e minimization prob- lem, and necessary and (almost) sufficien t conditions (Zhao and Y u, 2006) for the existence of a solutions can b e deriv ed satisfying the Karush-Kuhn-T uc ker (KKT) conditio ns. This ap- proac h has b een applied in s everal papers including W ain wrigh t (2006), Zhao and Y u (2006), Zou (2006) and H uang et al. (2008), and lead to a necessary condition f reque n tly denote in 4 the literature b y Irr epr esentable Con dition (IC). This condition is kno w to b e easily violated in the presence of highly correlated v ariables (Zhao and Y u, 2006; Meinshausen and Y u, 2009). Meinshausen and Y u (2009) examine the p erformance of the Lasso es timate in the case this con- dition is violated. A more comprehensiv e discussion ab out the IC and com parison with other conditions can b e found in Zhao and Y u (2006 ) and M ei nshausen and Y u (2009), section 1.5. In opp osition to Zou (2006) and H u ang et al. (2008), who assume one has consisten t zero- estimators of the parameters θ 0 (2) , we do not ass um e suc h es ti mators are av ailable; instead, w e assume a weak er form of the Irrepresen tability Condition denoted W e ak Irr epr esentability Condition (WIC). This condition reduces to the I C if w e ha v e P  min q 1 +1 ≤ j ≤ n 1 λ 1 j = | β ∗ | − 1  → 1 and P  min q 2 +1 ≤ j ≤ n 2 λ 2 j = | γ ∗ | − 1  → 1 ; and is equiv alen t to zero-consistency if λ 1 j and λ 2 j div erge as T increase. One should exp ect to b e in the b et wee n most of the time rendering this condition less restrictiv e than b oth I C and zero-consistency . W e ak Irr epr esentability Condition also implies that w e do not need consisten t estimators of θ 0 (2) an ymore to construct λ ij , i = 1 , 2 and j = q i + 1 , . . . , n i , rather we can us e biased estimators suc h as ridge estimators. Lemma 1 (KKT Conditions) . The solution ˆ β = ( ˆ β (1) ′ , ˆ β (2) ′ ) ′ and ˆ γ = ( ˆ γ (1) ′ , ˆ γ (2) ′ ) ′ to the minimizati on pr oblem (5) exists if: ∂ k Y − X β − Z γ k 2 2 ∂ β j (1)    β j (1)= ˆ β j (1) = sgn( ˆ β j (1)) λ 1 λ 1 j (6a) ∂ k Y − X β − Z γ k 2 2 ∂ γ j (1)    γ j (1)=ˆ γ j (1) = sgn( ˆ γ j (1)) λ 2 λ 2 j (6b) and ∂ k Y − X β − Z γ k 2 2 ∂ β j (2)    β j (2)= ˆ β j (2) ≤ λ 1 λ ij (7a) ∂ k Y − X β − Z γ k 2 2 ∂ γ j (2)    γ j (2)=ˆ γ j (2) ≤ λ 2 λ 2 j . (7b) Pr o of. The pro of of this lemma is simply the statemen t of the KKT conditions adapted to our problem. F ollo wing Zhao and Y u (2006), mo del selection consistency is equiv alen t do sign c onsistency . W e say that ˆ θ equals in sign to θ if s g n( ˆ θ ) = sgn( θ ) , and w e represen t this equalit y of signs b y ˆ θ = s θ . Definition 1 (Sign Consistency) . W e say that an estim ate ˆ θ is sign c onsistent to θ i f Pr( ˆ θ = s θ ) → 1 , as n → ∞ . Zhao and Y u (2006) refer to this kind of consistency as str ong si gn c onsistency , meaning that one can use a pre-selected regularization parameter to ac hiev e sign consistency , as opp osed to gener al sign c onsi ste ncy whic h states that for a random realization there exists a amoun t of regularizatio n that selects the true mo del. Before stating the IC to our problem , w e ha ve to in tro duc e some more notation. Let W (1) = ( X (1) , Z (1)) , W (2) = ( X (2) , Z (2)) and W = ( W (1) W (2)) , then Ω = Γ − 1 / 2 W ′ W Γ − 1 / 2 can b e divided in to four blo c ks, Ω 11 = Γ − 1 / 2 1 W (1) ′ W (1)Γ − 1 / 2 1 , Ω 21 = Γ − 1 / 2 2 W (2) ′ W (1)Γ − 1 / 2 2 , Ω 12 and Ω 22 . The normalization matrix Γ , is also divided in Γ 1 / 2 1 = diag ( T 1 ′ q 1 , √ T 1 q 2 ) and Γ 1 / 2 2 = diag( T 1 ′ n 1 − q 1 , √ T 1 ′ n 2 − q 2 ) are the f ollowin g 5 Assumption 3 (W eak Irrepresen table Condition) . The matrix Ω 11 is invertible, and for some 0 < η < 1 , P   \ 1 ≤ j ≤ m 1 n    [Ω 21 Ω − 1 11 ]sgn( θ 0 (1))    j ≤ β ∗ λ 1 j − η o   → 1 , and P   \ m 1 +1 ≤ j ≤ m 1 + m 2 n    [Ω 21 Ω − 1 11 ]sgn( θ 0 (1))    j ≤ γ ∗ λ 2 j − η o   → 1 , wher e [ · ] j denotes the j th element of the ve ctor i nside br ackets. Next prop osition (similar to prop ositio n 1 in Huang et al. (2008 )) pro v id es some low er b ounds on the probabilit y of Ada ptiv e Lasso c ho osing the correct mo del. Prop osition 1. L et λ = diag( λ 1 1 h 1 , λ 2 1 h 2 ) , wher e the dimensions h 1 and h 2 ar e adapte d to e ach c ase it app e ars, L (1) = diag( λ 11 , . . . , λ 1 q 1 , λ 21 , . . . , λ 2 q 2 ) and L (2) = diag( λ 1 q 1 +1 , . . . , λ 1 n 1 , λ 2 q 2 +1 , . . . , λ 2 n 2 ) . Then Pr  ˆ θ = s θ 0  ≥ Pr ( A T ∩ B t ) , wher e A T =  Γ − 1 / 2 | Ω − 1 11 W (1) ′ U | < Γ 1 / 2 | θ 0 (1) | − 1 2 Γ − 1 / 2 λ | Ω − 1 11 L (1)sgn( θ 0 (1)) |  (8a) B T = n 2 | Γ − 1 / 2 W (2) ′ M (1) U | < Γ − 1 / 2 λL (2) 1 n − q − Γ − 1 / 2 λ | Ω 21 Ω − 1 11 L (1)sgn( θ 0 (1)) | o , (8b) wher e M (1) = I T − W (1)( W (1) ′ W (1)) − 1 W (1) ′ and the pr evious ine qualities hold element-wise. 3 Mo del Selection Consistency and Oracle Prop ert y In this section we derive the main results of the pap er. W e show that, under s ome conditions on n , p , and λ ’s the Adaptiv e Lasso selects the correct s ub set of v ariables (sign consistenc y) and it has the oracle prop er t y in the sense of F an and Li (2001 ), meaning that our estimate has the same asymptotic distribution of the OLS as if w e knew b eforehand what v ariables are in the mo del and at optimal rate. A s tra igh tforw ard conclusion is that we can carry out h yp oth esis tests ab out the parameters in a traditional wa y , i.e. as if w e assume we ha v e the true mo del. In our case, the n um b er of v ariables q = q 1 + q 2 that actually enter in the mo del can gro w p olynomi ally with T , more precisely the n um b er of I (1) v ariables q 1 in the mo del is finite while the n um b er of I (0) v ariables in the mo del can increase p olynomial ly . The num b er of candidate v ariables n = n 1 + n 2 increase with T (b oth n 1 and n 2 increase with T at distinct rates) and is p ossibly larger than the sample size. The next assumption give sufficien t conditions f or mo del selection consistency . Assumption 4. The fol low ass umpti on s hold jointly for some fixe d 0 < ρ ≤ 1 : 1. λ 1 → ∞ and λ 1 /T 1+ ρ → 0 ; 2. λ 2 → ∞ and λ 2 /T (1+ ρ ) / 2 → 0 ; 6 3. q 1 = O (1) and q 2 = o ( T d/ (2 d +1) ) ; 4. m 1 = o ( T 2 /λ 2 1 ) and m 2 = o ( T d /λ 2 2 )) . This assumption tells us that the n um b er of v ariables is sub-linear with resp ect to the sample size T , ho wev er this assumption can b e relaxed at a cost of more structure ab out the tails of the error term. Assumption 5. The fol lowing assumptions hold jointly for some fixe d 0 < ρ ≤ 1 : 1. Ther e exist c on s tants β ∗ and γ ∗ such t hat: (i) Pr(max 1 ≤ j ≤ q 1 λ 1 j < β − 1 ∗ ) → 0 ; (ii) P r(max 1 ≤ j ≤ q 2 λ 2 j < γ − 1 ∗ ) → 0 ; 2. Ther e exists s tati onary pr o c esses V 1 j , j = 1 , . . . , q 1 , and V 2 j , j = 1 , . . . , q 2 such t hat: (i) T ρ λ 1 j ⇒ V 1 j ; (ii) T ρ/ 2 λ 1 j ⇒ V 2 j . The first ass umption requires the weigh ts λ 1 (1) and λ 2 (1) to b e b ounded from b elo w with probabilit y tending to 1. The last assumption is required for the oracle prop ert y and tells us that the data dep enden t w eigh ts a v e to con ve rge at a giv en rate for the adapt iv e Lasso to be oracle. Theorem 1 (Mo del Selection Consistency) . Under assumptions 1 – 5, P ( ˆ θ = s θ 0 ) → 1 . Theorem 2 (Oracle Prop ert y ) . Supp ose assumptions 1 to 5 ar e satisfie d, and also that ( λ 2 q 2 ) /T (1+ ρ ) / 2 → 0 . Then the fol lowing holds  T ( ˆ β (1) − β 0 (1)) √ T ( ˆ γ (1) − γ 0 (1))  ⇒  R B X (1) B ′ X (1) 0 0 ′ Σ z z  − 1 × R 1 0 B X (1) dB u N ( 0 , σ ∗ u 2 Σ ∗ Z (1) 2 ) ! . (9) 4 Numerical Results 4.1 Algorithm Since we are deali ng with b oth I (1) and I (0) series, w e cannot apply the plain v anilla LARS algorithm (Efron et al. , 2004) to our problem, instead w e wil l f ollo w F an and Li (2001) and Hun ter and Li (2005) and apply a lo cally quadratic appro ximation (LQA) to the p enalt y func- tion, more precisely the p erturb ed ve rsion in section 3.2 of Hun ter and Li (2005). This approac h also allow us to derive a closed form form ula for the s ta ndard error of the parameter estimates. F or a nonzero β j the p erturbed LQA of the Adapt iv e Lasso p enalt y is giv en b y λ 1 j | β j | ≈ λ 1 j | β 0 j | + λ 1 j 2( | β 0 j | + ε ) ( β 2 j − β 2 0 j ) , (10) 7 for some small ε > 0 , and s im ilarly for γ j ’s. Den ote this appro ximation by ψ j ( β j ) ; instead of minimizing (5), we minimize k Y − X β − Z γ k 2 2 + λ 1 n 1 X j =1 ψ j ( β j ) + λ 2 n 2 X j =1 ψ j ( γ j ) (11) iterativ ely un til the estimates con v erge. Define the diagonal matrix E k = diag λ 1 λ 11 ( | β ( k ) 1 | + ε ) , . . . , λ 1 λ 1 n 1 ( | β ( k ) n 1 | + ε ) , λ 2 λ 21 ( | γ ( k ) 1 | + ε ) , . . . , λ 2 λ 2 n 2 ( | γ ( k ) n 2 | + ε ) ! . The estimator of θ ( k +1) is giv en b y θ ( k +1) =  W ′ W + E k  − 1 W ′ Y . (12) One issue with the adaptiv e Lasso is to find the w eigh ts λ 1 j and λ 2 j . W e prop ose to use an iterated adaptive Lasso, whic h consists in recalculating the w eigh ts λ 1 j and λ 2 j eac h s te p. More precisely , E k = diag λ 1 λ ( k ) 11 ( | β ( k ) 1 | + ε ) , . . . , λ 1 λ ( k ) 1 n 1 ( | β ( k ) n 1 | + ε ) , λ 2 λ ( k ) 21 ( | γ ( k ) 1 | + ε ) , . . . , λ 2 λ ( k ) 2 n 2 ( | γ ( k ) n 2 | + ε ) ! , (13) with λ ( k ) 1 j = | β ( k − 1) j | − ρ and λ ( k ) 2 j = | γ ( k − 1) j | − ρ (14) and the initia l we igh ts w e calculate b y using ridge regression with regularization parameter λ ( ridg e ) , i.e. θ (0) = ( W ′ W + λ ( ridg e ) I n ) − 1 W ′ Y , (15) for the b est ch oice of λ ( ridg e ) . This algorithm has shown to b e stable in a n um ber of simulat ions, with only a small chan ge to ensure the n um b ers are within the margins of mac hine precision. 4.2 Standard Error F orm ula Hun ter and Li (2005) pro v id e a sandwic h form ula for computing the co v ariance matrix of the p e- nalized estimates of the nonzero comp onen ts that has b een prov en to b e consistent (F an and P eng (2004)). Zou (2006) adapted this form ula to the adaptiv e Lasso case and is given by d co v ( ˆ θ (1)) = σ ∗ uu ( W (1) ′ W (1) + E k (1)) − 1 W (1) ′ W (1)( W (1) ′ W (1) + E k (1)) − 1 . (16) If the parameter σ ∗ uu is unkno wn, one can replace it by its estimate from the full mo del. F or the zero-v alued v ariables, the standard errors are zero (F an and Li, 2001). Although the consistency result deriv ed by F an and P eng (2004) cannot b e directly applied to our case, the same conclusion can b e reac hed by adapting their pro of to the in tegrate d case. 8 4.3 Cho osing the regularization parameters T o implemen t the algorithm describ ed ab o ve, we need to estimate λ 1 , λ 2 and λ ( ridg e ) . W e will use the metho d called generalized cross-v alidation (GCV). Define the pro jection matrix of the ridge estimator (15 ) as P r ( θ ( λ ∗ )) = W ′ ( W ′ W + λ ∗ I n ) − 1 W ′ . (17) Hence, the nu m ber of effectiv e paramet ers e ( λ ∗ ) = trace( P r ( θ ( λ ∗ ))) . Therefore, the GCV s tat is- tic for this problem is GC V r ( λ ∗ ) = T − 1 k Y − W θ ( λ ∗ ) k 2 2 (1 − e ( λ ∗ ) /T ) 2 , (18) where θ ( λ ∗ ) = ( W ′ W + λ ∗ I n ) − 1 W ′ Y . W e find λ ( ridg e ) = arg min λ ∗ GC V r ( λ ∗ ) . F or the adaptiv e Lasso, define λ ∗ = ( λ ∗ 1 , λ ∗ 2 ) and E λ ∗ = diag λ ∗ 1 λ 11 ( | β (0) 1 | + ε ) , . . . , λ ∗ 1 λ 1 n 1 ( | β (0) n 1 | + ε ) , λ ∗ 2 λ 21 ( | γ (0) 1 | + ε ) , . . . , λ ∗ 2 λ 2 n 2 ( | γ (0) n 2 | + ε ) ! . (19) with λ 1 j = | β (0) j | − ρ and λ 2 j = | γ (0) j | − ρ , (20) where β (0) and γ (0) w ere estimated using (15). Define the pro jection matrix P l ( θ ( λ ∗ )) = W ′ ( W ′ W + E λ ∗ ) − 1 W ′ . (21) The n umber of effective parameters e ( λ ∗ ) is giv en b y trace( P l ( θ ( λ ∗ ))) , and the GCV s tatistic is GC V l ( λ ∗ ) = T − 1 k Y − W θ ( λ ∗ ) k 2 2 (1 − e ( λ ∗ ) /T ) 2 , ( 22) where θ ( λ ∗ ) = ( W ′ W + E λ ∗ ) − 1 W ′ Y . W e find λ = arg min λ ∗ GC V l ( λ ∗ ) . W e perform both minimiz ations b y doing a grid searc h b efore starting the adaptive Lasso estimation pro cedure. W e can also include ρ in the minimization of (22), but we found little impact b et w een choosing ρ dynamically and using it fixed at 0 . 9 . Smaller v alues for ρ did affect the p erformanc e of the estimates. 4.4 Sim ulation Studies In this section w e rep ort the results of the sim ulations studies. W e w an t to ev aluate the (i) mo del selection accuracy; (ii) estimation accuracy; and (iii) forecasting accuracy . W e will consider four distinct mo del sp ecifications. Eac h co v ariate is generate f rom a m ultiv ariate normal distribution with v ariance 1 and cov ariance structure defined in eac h mode l. W e simu late eac h model 500 times f or three distinct sample sizes T = 50 , 100 , 200 and an extra 50 observ ations are used for ev aluating prediction p erformanc e. Mo del 1 : u t ∼ N (0 , 1 . 5 2 ) , n 1 = n 2 = 15 . Set w t = ( v t , z t ) . The pairwise co v ariance b et ween the i th and j th elemen t of w t is giv en by cov( w it , w j t ) = r | i − j | , r = 0 . 5 , and v ar( w j ) = 1 . The parameters γ = β = (2 . 5 , 2 . 5 , 1 . 5 , 1 . 5 , 0 . 5 , 0 . 5 , 0 , . . . , 0) ′ , meaning we ha v e tw o large effects, t w o mo der ate effects and t w o w eak effects for X and Z . Mo del 2 : Similar to mo del 1, except that r = 0 . 9 . Mo del 3 : Similar to mo del 1, but the error term u t = 0 . 6 u t − 1 + e t , with e t ∼ N (0 , 1 . 5 2 ) . Mo del 4 : Similar to mo del 3, but n 1 = n 2 = 50 9 Mo del 5 : Similar to mo del 1, but n 1 = n 2 = 50 , the first 15 v ariables in z t and u t ha v e the same dep endence structure as in mo del 1, the remaining 2 × 35 v ariables are indep enden t. Mo del 6 : Similar to mo del 3, but e t ∼ t 4 In all examples we consider small, moderate and large effects for b oth I (1) and I (0) cov ariates. In mo del 1 w e study a simple framew ork with a mo derate n um b er of candidate v ariables and w eak to moderate correlation among them. In mo del 2 w e consider the case in whic h the v ariables are highly correlated. Mo del 3 consider the case in whic h the errors ha ve an A R(1) s tru cture. Mo dels 4 and 5 consider the case in whic h w e ha v e man y v ariables with distinc t correlations; and mo del 6 w e consider AR(1) errors with fat tails. 4.4.1 Mo del Selection Acc uracy: W e ev aluate mo del selection b y calculating the n um b er of corrected selected “non-zero” co effi- cien ts and the num b er of corrected selected “zero” co effi cien ts. W e use resampling to es ti mate the mean and standard deviation of the n umber of correct selected co efficien ts. In mo dels 1, 2, 3 and 6, the num b er of “zero” co efficie n ts is 18 ; for mo dels 4 and 5, the n um b er of “zero” co effic ien ts is 88 . F or all mo dels the n um b er of “non-zero” co effi cien ts is 12 . T able 1: V ariable Selection Perform ance 50 100 200 Mo del #nz #z #nz #z #nz #z 1 10 . 573 (0 . 824) 16 . 308 (1 . 367) 11 . 644 (0 . 528) 16 . 860 (1 . 177) 11 . 946 (0 . 225) 17 . 262 (0 . 837) 2 8 . 630 (1 . 014) 16 . 605 (1 . 453) 10 . 013 (0 . 802) 17 . 008 (1 . 034) 11 . 038 (0 . 567) 17 . 320 (0 . 859) 3 10 . 561 (0 . 850) 15 . 749 (1 . 485) 11 . 420 (0 . 673) 15 . 661 (1 . 392) 11 . 917 (0 . 277) 15 . 611 (1 . 449) 4 10 . 225 (0 . 921) 79 . 029 (3 . 270) 11 . 220 (0 . 727) 77 . 689 (5 . 567) 11 . 840 (0 . 388) 79 . 076 (3 . 536) 5 9 . 607 (1 . 112) 79 . 557 (3 . 794) 11 . 251 (0 . 857) 78 . 925 (8 . 175) 11 . 996 (0 . 060) 85 . 454 (1 . 888) 6 10 . 662 (0 . 854) 15 . 809 (1 . 498) 11 . 461 (0 . 643) 15 . 820 (1 . 401) 11 . 948 (0 . 222) 15 . 889 (1 . 421) W e can see from table 1 that the adaptiv e Lasso frequen tly s elect s the correct set of “non-ze ro” co effic ien ts with s m all c hanges due to correlation, distinct errors sp ecifi cations and n um b er of candidate v ariables, these effects being more pronounc ed in small samples. The method performs w ell ev en in small to mo derate samples. How ever, the sensibility of the mo del s election metho d for selecting the “zero” co efficien ts is aff ected by the n um b er of candidate v ariables and error structure. W e can see that the prop ortions of “zero”-parameters correctly selected is smaller in the case w e ha v e man y parameters and, particularl y , when there is a AR(1) structure in the error term. Comparing models 4 and 5, w e see that the combin ation of correlated errors and correlated v ariables has a large effect on the num b er of correctly s el ected “zero”-co effi cien ts in larger samples. 4.4.2 Estimation A ccuracy: W e ev aluate the es timation accu racy of the “non-zero” parameters and the standa rd deviation of the “non-zero” parameter estimates. F or the estimation accurac y of the parameters, we compare the mean squared error (MSE) of the estimated parameters with the mean square error of the “oracle-OLS” param eters; and f or the estimation accuracy of the parameter standard 10 T able 2: MSE: Mo del 1 50 100 200 P arameter A daLasso Oracl e-OLS A daLasso O racle-OLS A daLa sso Oracle-OLS β 1 0 . 117 0 . 045 0 . 0 18 0 . 010 0 . 004 0 . 002 β 3 0 . 129 0 . 062 0 . 0 22 0 . 012 0 . 004 0 . 003 β 5 0 . 131 0 . 051 0 . 0 22 0 . 013 0 . 003 0 . 003 γ 1 0 . 148 0 . 087 0 . 0 52 0 . 033 0 . 024 0 . 017 γ 3 0 . 158 0 . 102 0 . 0 55 0 . 045 0 . 023 0 . 019 γ 5 0 . 154 0 . 113 0 . 0 65 0 . 042 0 . 025 0 . 021 T able 3: MSE: Mo del 2 50 100 200 P arameter AdaLa sso Oracle-OLS A daLasso Oracle-OLS Ada Lasso Oracle-OLS β 1 0 . 889 0 . 171 0 . 1 11 0 . 035 0 . 022 0 . 009 β 3 0 . 834 0 . 323 0 . 1 38 0 . 065 0 . 025 0 . 017 β 5 0 . 329 0 . 309 0 . 1 38 0 . 074 0 . 028 0 . 015 γ 1 1 . 152 0 . 392 0 . 3 65 0 . 146 0 . 122 0 . 061 γ 3 1 . 002 0 . 549 0 . 4 54 0 . 257 0 . 167 0 . 125 γ 5 0 . 373 0 . 660 0 . 2 32 0 . 249 0 . 154 0 . 114 deviation w e compare the estimate calcul ates by using (16) and the standa rd error calcu lated using resampling. W e presen t the results for ( β 1 , β 3 , β 5 , γ 1 , γ 3 , γ 5 ) f or all six mo dels. T ables 2 – 7 show the MSE of the parameters estimates. A s expected the n um b er of candidate v ariables, the co v ariance structure and the error structure affect the estimates. In s m all samples the standard error of the estimates are muc h larger than the oracle, how ev er the mean square error quic kly con verges to the oracle MSE, as exp ect ed from theorem 2. The w orst p erformance w as mo del 4 that sho w ed an MSE of the β estimates almost three time as big as the oracle in mo der ate-to-large s am ples (200 observ ations), ho w ev er the decrease in the MSE is ve ry steep, indicating that this difference v anishes in larger samples. In fact, this error is really small in larger samples, b eing negligible when w e ha v e 1000 observ ations. T ables 8 – 13 compare the estimated standard deviation (SD) of the parameter with the T able 4: MSE: Mo del 3 50 100 200 P arameter AdaLa sso Oracle-OLS AdaLasso Oracle-OL S Ada Lasso Oracle-OLS β 1 0 . 192 0 . 105 0 . 0 58 0 . 036 0 . 020 0 . 011 β 3 0 . 196 0 . 129 0 . 0 67 0 . 044 0 . 021 0 . 014 β 5 0 . 174 0 . 128 0 . 0 77 0 . 050 0 . 020 0 . 013 γ 1 0 . 142 0 . 100 0 . 0 59 0 . 044 0 . 028 0 . 023 γ 3 0 . 155 0 . 117 0 . 0 61 0 . 054 0 . 029 0 . 028 γ 5 0 . 138 0 . 115 0 . 0 79 0 . 056 0 . 035 0 . 029 11 T able 5: MSE: Mo del 4 50 100 200 P arameter AdaLa sso Oracle-OLS AdaLasso Oracle-OL S Ada Lasso Oracle-OLS β 1 0 . 334 0 . 098 0 . 0 87 0 . 033 0 . 029 0 . 012 β 3 0 . 282 0 . 116 0 . 1 00 0 . 046 0 . 028 0 . 013 β 5 0 . 190 0 . 117 0 . 0 96 0 . 045 0 . 032 0 . 012 γ 1 0 . 247 0 . 104 0 . 0 70 0 . 039 0 . 026 0 . 022 γ 3 0 . 216 0 . 131 0 . 0 80 0 . 059 0 . 029 0 . 028 γ 5 0 . 175 0 . 116 0 . 1 05 0 . 053 0 . 029 0 . 030 T able 6: MSE: Mo del 5 50 100 200 P arameter AdaLa sso Oracle-OLS AdaLasso Oracle-OL S Ada Lasso Oracle-OLS β 1 0 . 404 0 . 043 0 . 0 70 0 . 010 0 . 005 0 . 002 β 3 0 . 341 0 . 049 0 . 0 82 0 . 012 0 . 004 0 . 003 β 5 0 . 208 0 . 057 0 . 1 03 0 . 013 0 . 004 0 . 003 γ 1 0 . 392 0 . 060 0 . 0 55 0 . 029 0 . 013 0 . 012 γ 3 0 . 385 0 . 064 0 . 0 59 0 . 031 0 . 012 0 . 012 γ 5 0 . 191 0 . 061 0 . 0 75 0 . 026 0 . 015 0 . 012 T able 7: MSE: Mo del 6 50 100 200 P arameter AdaLa sso Oracle-OLS AdaLasso Oracle-OL S Ada Lasso Oracle-OLS β 1 0 . 173 0 . 099 0 . 0 54 0 . 032 0 . 021 0 . 010 β 3 0 . 169 0 . 114 0 . 0 57 0 . 041 0 . 018 0 . 012 β 5 0 . 165 0 . 117 0 . 0 66 0 . 040 0 . 015 0 . 012 γ 1 0 . 133 0 . 089 0 . 0 46 0 . 040 0 . 021 0 . 019 γ 3 0 . 147 0 . 119 0 . 0 56 0 . 053 0 . 024 0 . 027 γ 5 0 . 140 0 . 115 0 . 0 72 0 . 049 0 . 031 0 . 021 12 T able 8: Mo del 1: Standard Dev iation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 287 0 . 165 0 . 128 0 . 080 0 . 053 0 . 040 β 3 0 . 333 0 . 181 0 . 132 0 . 090 0 . 063 0 . 046 β 5 0 . 374 0 . 109 0 . 157 0 . 083 0 . 064 0 . 045 γ 1 0 . 356 0 . 276 0 . 194 0 . 172 0 . 127 0 . 121 γ 3 0 . 406 0 . 296 0 . 222 0 . 190 0 . 147 0 . 135 γ 5 0 . 404 0 . 169 0 . 273 0 . 143 0 . 168 0 . 121 T able 9: Mo del 2: Standard Dev iation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 576 0 . 345 0 . 270 0 . 179 0 . 111 0 . 084 β 3 0 . 919 0 . 358 0 . 372 0 . 222 0 . 152 0 . 110 β 5 0 . 637 0 . 163 0 . 434 0 . 111 0 . 207 0 . 089 γ 1 0 . 682 0 . 623 0 . 404 0 . 396 0 . 253 0 . 254 γ 3 1 . 048 0 . 563 0 . 650 0 . 450 0 . 368 0 . 324 γ 5 0 . 739 0 . 210 0 . 586 0 . 168 0 . 451 0 . 130 actual standard deviation of the parameter calculated using resampling. W e estimate σ uu and σ ∗ uu assuming kno wledge of the data generating pro cess of the error term, whic h is a reasonable assumption since w e are only int erested in verifying the behavior of the prop osed form ula in finite samples. If the data gen erating pro cess is unknown, we can estimate the autoregressiv e order using the same metho d prop osed here. W e can see that, for all mo del sp ecifications, the difference b et ween the estimated s tan dard deviations calculated using resampling and equation (16) s hri nk as the sample s ize increases for b oth β and γ . The w orst p erformance w as mo del 2, where the v ariables are highly correlated. In larger samples the estimated s ta ndard deviation is reasonably close to the “true” one estimated b y using resampling. T able 10: Mo del 3: Standard Deviation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 399 0 . 389 0 . 226 0 . 206 0 . 124 0 . 107 β 3 0 . 450 0 . 417 0 . 251 0 . 228 0 . 136 0 . 114 β 5 0 . 436 0 . 253 0 . 281 0 . 189 0 . 145 0 . 115 γ 1 0 . 380 0 . 341 0 . 231 0 . 240 0 . 153 0 . 166 γ 3 0 . 370 0 . 371 0 . 232 0 . 263 0 . 172 0 . 184 γ 5 0 . 406 0 . 213 0 . 291 0 . 191 0 . 186 0 . 166 13 T able 11: Mo del 4: Standard Deviation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 479 0 . 615 0 . 301 0 . 367 0 . 155 0 . 152 β 3 0 . 512 0 . 673 0 . 332 0 . 405 0 . 171 0 . 170 β 5 0 . 470 0 . 494 0 . 329 0 . 338 0 . 185 0 . 166 γ 1 0 . 380 0 . 555 0 . 270 0 . 428 0 . 148 0 . 246 γ 3 0 . 429 0 . 606 0 . 284 0 . 477 0 . 166 0 . 274 γ 5 0 . 427 0 . 328 0 . 315 0 . 348 0 . 181 0 . 245 T able 12: Mo del 5: Standard Deviation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 543 0 . 301 0 . 226 0 . 125 0 . 064 0 . 040 β 3 0 . 562 0 . 329 0 . 261 0 . 140 0 . 070 0 . 046 β 5 0 . 491 0 . 218 0 . 307 0 . 113 0 . 082 0 . 046 γ 1 0 . 452 0 . 445 0 . 233 0 . 239 0 . 113 0 . 106 γ 3 0 . 511 0 . 395 0 . 236 0 . 238 0 . 121 0 . 106 γ 5 0 . 359 0 . 163 0 . 260 0 . 190 0 . 131 0 . 100 T able 13: Mo del 6: Standard Deviation and Estimated Standard Deviation 50 100 200 P arameter σ ˆ σ σ ˆ σ σ ˆ σ β 1 0 . 356 0 . 339 0 . 217 0 . 196 0 . 117 0 . 100 β 3 0 . 425 0 . 386 0 . 243 0 . 215 0 . 128 0 . 110 β 5 0 . 420 0 . 248 0 . 277 0 . 175 0 . 140 0 . 111 γ 1 0 . 334 0 . 307 0 . 206 0 . 225 0 . 148 0 . 159 γ 3 0 . 370 0 . 332 0 . 230 0 . 247 0 . 163 0 . 176 γ 5 0 . 406 0 . 185 0 . 280 0 . 181 0 . 170 0 . 160 14 4.4.3 Prediction A ccuracy: W e ev aluate the prediction accuracy b y calculating prediction mean square error 1 (PMSE) for eac h model and dividing b y the “oracle-OLS” PMSE, i.e. the PMSE of the OLS estimator conditional on kno wing the v ariables that en ter in the mo del. This measure tells us ho w close w e are from the traditional OLS predictor, a num b er close to 1 means that the predictio n accuracy is very close to the oracle prediction. T o av oid the effect of large v alues, we used the a v erage median of the PMSEs, estimated using resampling. T able 14 summarizes the results. T able 14: Predicton Mean Squared Error Mo del 50 100 200 1 1 . 640 1 . 101 1 . 022 2 1 . 516 1 . 174 1 . 075 3 1 . 559 1 . 418 1 . 329 4 4 . 524 4 . 887 4 . 362 5 7 . 297 4 . 188 1 . 120 6 1 . 442 1 . 721 1 . 343 W e can see that the PMSE approac hes the oracle PMSE as the sample size increases. The rate in whic h the prediction error decreases dep ends on the n um b er of candidate v ariables and the error structure, for instance, in mo dels 4 and 5 the PMSE can b e as m uc h as 7 times larger than the oracle in small samples, but this error rapidly con ve rges to the oracle in the case where the errors are i.i.d. and the candidate v ariables uncorrelated with the v ariables in the mo del. In mo del 4 the relativ e PMSE is v ery large and decreases slo wly . This b eha vior can b e explained b y observing the p erformanc e of the metho d in c ho osing the “zero” parameters in this mo del. W e can see that although the mo del selects the correct set of “non-zero” parameters correctly , a n um b er of “zero” parameters is also selected and, since we are dealing with “explosiv e” regressors, the mo del prediction v ariance also increases. Ho wev er, as the sample size increases the relativ e error also decreases as exp ected , for instance for sample sizes 500 and 1000 , the relativ e PMSE are resp ectiv ely 3 . 837 and 3 . 013 . 5 Conclusion In this paper, w e pro vide an extension of the A daptive Lasso v ariable s election method to coin tegrated regressions. W e sho w that, under some regularit y conditions f re quen tly assumed in the mo del selection literat ure and coin tegration literature, the metho d selects the correct subset of v ariables and con v erges to the “oracle” estimate, i.e. the estimator under the assumption we kno w the v ariables that en ter in the mo del. Although the result only allo ws for a sub-linear n um b er of I (1) candidate v ariables and a p olynomi al n um b er of canditate I (0) v ariables. W e allo w the num b er of I (0) v ariables that en ter in the mo del to increase with the sample size T . Suc h condition allo w for Dynamic OLS Estimation if w e consider the integr ated v ariables to b e endogenous. Another intere sting extension is the multiv ariate case. W e can see that all results hold for the v ector case if the dimension of y t is fixed, i.e., a fixed n um ber of regressions. It can b e sho wn b y jus t adapti ng the pro of of the theorems and conditions to the vecto r case. 1 P M S E = K − 1 P T + K t = T + 1 ( y t − ˆ y t ) 2 , where ˆ y t is the predicted v alue o f y t using the estimated parameters. 15 All the previous result hold if all parameters β = 0 or γ = 0 , meaning that we do not need I (1) or I (0) v ariable s f or the results to hold. Also, the inclusion of the in tercept do es not chan ge our results. References J. Bai and S. Ng. F orecasting economic time series using targeted predictors. Journal of Ec ono- metrics , 146:304–317, 2008. M. Caner. Lasso-type GMM estimator. Ec onometric The ory , 25(01):270–29 0, 2009. M. Caner and K. Knigh t. No coun try for old unit y ro ot tests : bridge estimators differen tiate b et w een nonstationary ve rsus stationary mo els and select optimal lag, 2008. W orking P aper, Univ ersit y of T oron to. R. de Jong and J. Davidson. The functional cen tral limit theorem ad w eak con v ergence to sto c hastic integ rals i: we akly dep enden t pro cesses. Ec onometric The ory , 16:621–642, 2000. B. Efron, T. Hastie, I . Johnstone, and R. Tibshir ani. Least angle regression. The Annals of Statistics , 32(2):407–499, 2004. J. F an and R. Li. V ariable selection via nonconca ve penalized likelih o od and its oracle properties. Journal of the Americ an Statistic al Asso ciati on , 96:1348–1360, 2001. J. F an and H. P eng. Nonconca v e p enalized lik elihoo d with a diverging num b er of parameters. The Annals of Statistics , 32(3):928–961, 2004. T. Hastie and H. Zou. Regularization and v ariable selection via the elastic net. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , 67:301–320, 2005. N. Hsu, H. Hung, and Y. Chang. Subset selection for vector autoregressiv e process es using lasso. Computational Statistics & Data Analysis , 52(7):3645–3657, 2008. J. Huang, S. Ma, and C.-H. Shang. Adapt iv e lasso for sparse high-dimen sional regression mo dels. Statistic a S i nic a , 18:1603–1618, 2008. J. Huang, J. Horo witz, and S. Ma. Asymptotic prop erties of bridge estimators in sparse high- dimensional regression mo dels. Annals of Statistics , 36(2):587–613, 2009. D. Hunter and R. Li. V ariable selection using mm algorithms. The Annals of S tatistics , 33(4): 1617–1642, 2005. K. Knigh t and W. F u. As ympt otics for lasso-t yp e estimators. The Annals of Statistics , 28(5): 1356–1378, 2000. Z. Liao and P . Phillips. Automated estimation of vector error correction mo dels. 2010. N. Meinshausen and P . Bühlmann. High-dimensional graphs and v ariable selection with the lasso. The Annals of S tatistics , 34:1436–14 62, 2006. N. Meinshausen and B. Y u. Lasso-type reco v ery of sparse represen tation s for high dimensional data. The Annals of S tatistics , 37:246–270 , 2009. P . C. Phillips and S. N. Durlauf. M ul tiple time series regression with in tegrated pro cesses. R eview of Ec onomic Studies , 53:473–495, 1986. 16 S. Song and P . J. Bick el. Large V ector Auto Regressions. ArXiv e-prints , 2011. J. Sto c k and M. W atson. F orecasting using principal comp onen ts from large n um b er of predictors. Journal of the Americ an Statistic al Asso ciati on , 97:1167–1179, 2002. R. Tibshirani. Regression shrink age and s el ection via the l as so. Journal of the R oyal Statis ti c al So ciety. Series B (Metho dolo gic al) , 58(1):267–288, 1996. M. W ain wrigh t. Sharp thresholds for high-dimension al and noisy recov ery of sparsit y . Arxiv pr eprint math/0605740 , 2006. H. W ang, G. Li, and C. T sai. Regression co efficien t and autoregressiv e order shrink age and selec- tion via the lasso. Journal of the R oyal S tati stic al So ciety: Series B(Statistic al Metho dolo gy) , 69(1):63–78, 2007. M. Y uan and Y . Lin. Mo del s ele ction and estimation in regression with group ed v ariables. Journal of the R oyal S tatistic al So ciety. Series B (Metho dolo gic al) , 68:49–67, 2006. P . Zhao and B. Y u. On mo del consistency of lasso. Journal of Machine L e arn ing R ese ar ch , 7: 2541–2563, 2006. H. Zou. The adaptiv e lass o and its oracle prop eries. Journal of the Americ an Statistic al Asso- ciation , 101:1418–142 9, 2006. A Pro of of theorems 1 and 2 Before presentin g the pro of of Theorems 1 and 2 we in tro duc e an usef ul lemma. Lemma 2. L et Ω ∞ =  Ω X, ∞ 0 0 ′ Ω Z, ∞  (23) wher e Ω X, ∞ = Z 1 0 B X (1) ( r ) B ′ X (1) ( r ) dr and Ω Z, ∞ = Σ Z (1) 2 , wher e for any 0 ≤ r ≤ 1 , B X (1) ( r ) = lim T →∞ T − 1 / 2 P ⌈ rT ⌉ t =1 v t (1) . Simi larly, split the matrix Ω 11 into Ω 11 = Ω X (1) 2 Ω Z (1) X (1) Ω ′ Z (1) X (1) Ω Z (1) 2 ! =  T − 2 X (1) ′ X (1) T − 3 / 2 Z (1) ′ X (1) T − 3 / 2 X (1) ′ Z (1) T − 1 Z (1) ′ Z (1)  . (24) L et δ = ( δ ′ 1 , δ ′ 2 ) ′ and ξ = ( ξ ′ 1 , ξ ′ 2 ) ′ denote a c ouple of ( q 1 + q 2 ) × 1 ve ctors satisfying δ ′ i δ i ≤ q i and ξ ′ i ξ i ≤ q i for i = 1 , 2 . Then under Assumption 1 and i f q 1 = O (1 ) and q 2 = o ( T 1 / 2 ) , we have (a) δ ′ (Ω 11 − Ω ∞ ) ξ = o p (1) ; (b) δ ′ 1 (Ω X (1) 2 − Ω X, ∞ ) ξ 1 = o p (1) ; (c) δ ′ 2 (Ω Z (1) 2 − Ω Z, ∞ ) ξ 2 = o p (1) ; and (d) δ ′ 2 Ω Z (1) X (1) ξ 1 = o p (1) . 17 Pr o of. Let’s first consider the off-diagonal elemen ts Ω X (1) Z (1) = T − 3 / 2 X (1) ′ Z (1) . W e ha v e sup k δ 1 k 2 ≤ q 1 , k ξ 2 k 2 ≤ q 2 δ ′ 1 ( T − 3 / 2 X (1) ′ Z (1)) ξ 2 = T − 1 / 2 sup k δ 1 k 2 ≤ q 1 , k ξ 2 k 2 ≤ q 2 q 1 X i =1 q 2 X j =1 δ 1 i ξ 2 j ( T − 1 T X t =1 x it z it ) ≤ q 1 q 2 T 1 / 2 O p (1) = q 2 T 1 / 2 O (1) O p (1) = o p (1) b ecause q 2 /T 1 / 2 = o (1) . It from classical results in coin tegration theory that the e le men t | Ω X (1) 2 − Ω X, ∞ | = o p (1) since q 1 = O (1) . Finally , we ha v e to sho w that δ ′ 1 (Ω Z (1) 2 − Ω Z, ∞ ) ξ 1 = o p (1) . N ot e that G T = √ T δ ′ 2 (Ω Z (1) 2 − Ω Z, ∞ ) ξ 2 is a cen tered empirical pro cess and that for an y ε > 0 , Pr  δ ′ 2 (Ω Z (1) 2 − Ω z , ∞ ) ξ 2 > ε  = Pr  G T ≥ √ T ε  ≤ E ( G T ) 2 /T ε 2 ≤ q 2 2 max 1 ≤ i ≤ j ≤ q 2 E ( T − 1 / 2 P z it z j t − σ ij ) 2 εT = q 2 2 εT O (1) → 0 . Finally , com bining these three results w e ha ve δ ′ (Ω 11 − Ω ∞ ) ξ = o p (1) , pro ving the lemma. Pr o of of the or em 1. W e kno e from prop osition 1 that sho wing s ig n consistency is equiv alen t to sho wing that Pr( A T ∩ B T ) → 1 . It is sufficien t to sho w that 1 − Pr( A c T ) − Pr( B c T ) → 1 , the sup erscript “ c ” meani ng complemen t. The proof is divided in tw o parts. In the first one w e sho w that Pr( A c T ) → 0 and in the second part w e sho w that Pr( B c T ) → 0 . Note the ev en t A c T is giv en b y A T =  Γ − 1 / 2 | Ω − 1 11 W (1) ′ U | < Γ 1 / 2 | θ 0 (1) | − 1 2 Γ − 1 / 2 λ | Ω − 1 11 L (1)sgn( θ 0 (1)) |  where the inequalit y holds elemen t wise. Hence, the comp lemen t is an union and can b e split in to A c T ( X ) ∪ A c T ( Z ) , with the ev en ts A c T ( X ) and A c T ( Z ) giv en b y A c T ( X ) = n T − 1 | [Ω − 1 X (1) 2 + o p (1)] X (1) ′ U | > T | β 0 (1) | − 1 2 T − 1 λ 1 | [Ω − 1 X (1) 2 + o p (1)] L X (1)sgn( β 0 (1)) |  and A c T ( Z ) = n T − 1 / 2 | [Ω − 1 Z (1) 2 + o p (1)] Z (1) ′ U | > T 1 / 2 | γ 0 (1) | − 1 2 T − 1 / 2 λ 2 | [Ω − 1 Z (1) 2 + o p (1)] L Z (1)sgn( γ 0 (1)) |  18 W e first deal with A c T ( X ) . By Assumptions 5, 4 and 1, and b y using Lemma 2 w e ja v e T − 1 λ 1 λ 1 j | [(Ω − 1 X (1) 2 + o p (1))sgn( β 0 (1)) | ] j = λ 1 T 1+ ρ V 1 j [ | (Ω − 1 X, ∞ )sgn( β 0 (1)) | ] j | + o p (1) , = o p (1) , where the first line follo ws from lemma 2 and Assumption 5 ( T ρ λ 1 j = V 1 j + o p (1) ) and the last line follo ws from λ 1 = o ( T 1+ ρ ) and the fact that | (Ω − 1 X, ∞ )sgn( β 0 (1)) | j = q 1 [ O p (1) + o p (1)] = O p (1) . Hence, Pr ( A c T ( X )) = Pr  h | T − 1 Ω − 1 X, ∞ X (1) ′ U i j > T | β 0 j |  , j = 1 , . . . , q 1  + o p (1) ≤ q 1 X j =1 Pr  h | T − 1 Ω − 1 X, ∞ X (1) ′ U i j > T | β 0 j |  + o p (1) ≤ q 1 T 2 β 2 ∗ max 1 ≤ j ≤ q 1 E  h T − 1 | Ω − 1 X, ∞ X (1) ′ U | i 2 j  + o p (1) → 0 , where the s ec ond line follow s from the union b ound, third lin e from the Cheb yschev’s inequalit y and the last line b y Assumption 1 and b ecause q 1 is constan t. No w w e fo cus our atten tion on Pr( A c T ( Z )) . First denote b y D T the even t {k δ k 2 = q 2 : δ ′ | ( T − 1 Z (1) Z (1)) − 1 − Ω − 1 Z, ∞ ) | δ > ετ − 1 ∗ } , for ε + 1 < c ε | γ ∗ | and c ε some p ositiv e constan t. W e ha v e alreadu shown that Pr( D T ) → 0 as T → ∞ . Consider the sp ectral decomp osition of Ω Z, ∞ = E D E ′ with E a matrix of q 2 eigen vectors and D a diagonal matrix of eigen v alues. By assumption the elemen ts of D are greater than τ ∗ , then inside D c T and for all j = 1 , . . . , q 2 , T − 1 / 2 λ 2 λ 2 j [ | (Ω − 1 Z, ∞ + ε/τ ∗ )sgn( γ 0 (1) | ] j = T − 1 / 2 λ 2 λ 2 j [ | E D − 1 E ′ sgn( γ 0 (1)) | ] j + T − 1 / 2 λ 2 λ 2 j q 2 ε/τ ∗ ≤ T − 1 / 2 q 2 λ 2 λ 2 j /τ ∗ + T − 1 / 2 λ 2 λ 2 j q 2 ε/τ ∗ = (1 + ε ) λ 2 q 2 τ ∗ T (1+ ρ ) / 2 V 2 j (1 + o p (1)) ≤ c ε γ ∗ λ 2 q 2 τ ∗ T (1+ ρ ) / 2 V 2 j (1 + o p (1)) , where the second line follo ws from [ | Ω − 1 z , ∞ sgn( γ 0 (1)) | ] 2 j ≤ su p k δ k =1  | δ ′ [Ω − 1 z , ∞ sgn( γ 0 (1)] |  2 ≤ su p k δ k =1 k δ k 2 k Ω − 1 z , ∞ sgn( γ 0 (1)) k 2 = sgn( γ 0 (1)) ′ E D − 2 E ′ sgn( γ 0 (1)) ≤ k sgn( γ 0 (1)) k 2 k E k 2 τ − 2 ∗ ≤ q 2 1 τ − 2 ∗ and the third line from the assumption that T ρ/ 2 λ 2 j con verges to a stationary pro cess. 19 Then, Pr ( A c T ( Z ) ∩ D c T ) ≤ Pr  max 1 ≤ j ≤ q 2 [ | T − 1 / 2 Ω − 1 Z, ∞ Z (1) ′ U | ] j > T 1 / 2 | γ ∗ | − c ε γ ∗ q 2 λ 2 T − (1 / 2+ ρ ) V 2 τ − 1 ∗  ≤ γ 2 ∗ T E h max 1 ≤ j ≤ q 2 [ | T − 1 / 2 Ω − 1 Z, ∞ Z (1) ′ U | ] 2 j i (1 − c ε λ 2 q 2 V 2 /τ ∗ T 1+ ρ/ 2 ) 2 ≤ γ 2 ∗ T q 2+1 /d 2 τ − 2 ∗ max j  E | P T t =1 z j t u t | 2 d  1 /d h 1 −  c ε τ ∗ λ 2 T (1+ ρ ) / 2 q 2 T 1 / 2 V 2 i 2 → 0 , where the second line from the Cheb ys c hev’s inequalit y . The third line follows from the b ound  max j [ T − 1 / 2 Ω − 1 Z, ∞ Z (1) ′ U ] j  2 = max j  [ T − 1 / 2 E D − 1 E ′ Z (1) ′ U ] j  2 ≤ τ − 2 ∗ q 2 2 (max j [ T − 1 / 2 Z (1) ′ U ] j ) 2 , and by the Jensen’s inequalit y , E (max j | T − 1 / 2 P T t =1 z j t u t | 2 ) ≤ q 1 /d 2 max j  E | T − 1 / 2 P T t =1 z j t u t | 2 d  1 /d . The conclusion follo ws from ass um ptions 1, 4 and 5. Mo ving to B c T , it follo ws from Le mma 2 that M (1) = M ∞ (1) + o p (1) , and the matrix M ∞ (1) = diag ( M X (1) , M Z (1)) , with M X (1) = I T − X (1)( X (1) ′ X (1)) − 1 X (1) ′ and M Z (1) = I T − Z (1)( Z (1) ′ Z (1)) − 1 Z (1) ′ . The ev en ts B c T ( X ) and B c T ( Z ) can b e written as B c T ( X ) =  max q 1 T − 1 λ 1 λ 1 j − λ 1 | T − 1 x ′ j X (1)[Ω − 1 X, ∞ + o p (1)] L X (1)sgn( β 0 (1)) | o , and B c T ( Z ) =  max q 2 T − 1 / 2 λ 1 λ 2 j − λ 2 | T − 1 / 2 z j ′ Z (1)[Ω − 1 Z, ∞ + o p (1)] L Z (1)sgn( γ 0 (1)) | o . W e further consider the even t C T ( X ) = { max 1 ≤ j ≤ q 1 λ 1 j < β − 1 ∗ } and C T ( Z ) = { max 1 ≤ j ≤ q 2 λ 2 j < γ − 1 ∗ } , then Pr( B c T ( X )) ≤ Pr( B c T ( X ) ∩ C T ) + Pr( C c T ( X )) , (25a) Pr( B c T ( Z )) ≤ Pr( B c T ( Z ) ∩ C T ) + Pr( C c T ( Z )) . (25b) By the W eak Irrepresen table Condition, one has inside C T ( X ) T − 1 λ 1 | x ′ j X (1)[Ω − 1 X, ∞ + o p (1)] L X (1)sgn( β 0 (1)) | ≤ λ 1 ( β ∗ λ 1 j − η ) T β ∗ + o p (1) , 20 and hence, T − 1 λ 1 λ 1 j − T − 1 λ 1 | x ′ j X (1)[Ω − 1 X, ∞ + o p (1)] L X (1)sgn( β 0 (1)) | ≤ λ 1 η T β ∗ + o p (1) . Therefore, Pr ( B c T ( X ) ∩ C T ( X )) ≤ Pr  max q 1 +1 ≤ j ≤ n 1 | 2 T − 1 x j M X (1) U | > λ 1 η /T β ∗  + o p (1) ≤ 4 β 2 ∗ η 2 E [max j | T − 1 x j M X (1) U | 2 ] T 2 λ 2 1 + o p (1) ≤ 4 β 2 ∗ max j E | T − 1 x ′ j U | 2 η 2 m 1 T 2 λ 2 1 + o p (1) → 0 , where the second line follow s b y the Cheb ysc hev’s inequalit y , the third line from the fact that for an y pro jection matrix M , E | x ′ j M U | 2 = E | x ′ j U | 2 − E | x j ( I − M ) ′ U | 2 ≤ E | x ′ j U | 2 ; and the last line from assumption 4 and q 1 = O (1) . Applying the same reasoning to B C T ( Z ) ∩ C T ( Z ) , the WIC giv es us Pr ( B c T ( Z ) ∩ C T ( Z )) ≤ Pr  max q 2 +1 ≤ j ≤ n 2 | 2 T − 1 / 2 z j M Z (1) U | > λ 2 η /T 1 / 2 γ ∗  + o p (1) ≤ 4 γ 2 ∗ η 2 E [max j | T − 1 / 2 z j M Z (1) U | 2 ] T λ 2 2 + o p (1) ≤ 4 γ 2 ∗ c d η 2 m 1 /d 2 T λ 2 2 + o p (1) → 0 , where the second line follo ws from Cheb ysc hev’s inequalit y , the third line b y noticing that the M Z (1) is a pro jection matrix, whic h implies E max j | T − 1 / 2 z j M Z (1) U | 2 = E max j | T − 1 / 2 z ′ j U | 2 ≤ m 1 /d 2 max j  E | T − 1 / 2 z ′ j U | 2 d  1 /d ≤ m 1 /d 2 c d . Finally , b oth Pr ( A c T ) and Pr( B c T ) con verge to 0 and Pr( A T ∩ B T ) → 1 , proving the theorem. A.1 Pro of of theorem 2 Pr o of. Theorem 1 tells us that the adaptiv e Lasso estimator (5) asymptotically chooses the correct s et of non-zero parameters. It remains to s how that the distribution of the estimator of the non-zero parameters is the same as the OLS estimator conditional on kno wing the correct set of parameters. W rite the deriv ative of the criterion function in (5) is given by Q T ( θ ) = − 2( Y − W (1) θ (1)) ′ W (1) + 2( W (2) θ (2) ) ′ W (1) + λ (1) L (1)sgn ( θ (1)) , (26) 21 where L (1) and λ (1) are as in prop osition 1. Setting Q t ( ˆ θ ) = 0 , and U = Y − W (1) θ 0 (1) , we find Γ 1 / 2 ( ˆ θ (1 ) − θ 0 (1)) = Ω − 1 11 Γ − 1 / 2 U ′ W (1) − Ω − 1 11  Γ − 1 / 2 ˆ θ (2 ) W (2) ′ W (1) + 1 2 Γ − 1 / 2 λ (1) L (1)sg n ( θ (1))  , (27) whic h tells us the adaptiv e Lasso estimator has the same form of a biased OLS estimator, with the bias b et we en square brack ets. Hence, (9) is equiv alen t to showi ng Ω 11 con verges to the optimal co v ariance matrix; T − 1 U ′ X (1) has a mixing normal distribution; T − 1 / 2 U ′ Z (1) has a normal distribution and the terms in square brac k ets con verge to zero. W e hav e already s ee n in pro of of theorem 1 that Ω 11 ⇒ R B X (1) B ′ X (1) dr 0 0 ′ Σ Z (1) 2 ! . Since q 1 = O (1) , it f oll o ws from assumption (DGP) that T − 1 U ′ X (1) ⇒ Z B X (1) dB u . Using the Crámer-W old device, one can s how that for an y q 2 × 1 vector α satisfying α ′ α ≤ 1 , T − 1 / 2 α ′ E Z (1) ′ U = 0 and E  T − 1 / 2 α ′ Z (1) ′ U  2 = α ′ E [ Z (1) ′ U U ′ Z (1)] α → σ ∗ u 2 α ′ Σ Z (1) 2 α. where the last line f ollo ws from as s um ption (DGP). Com bining the Crámer-W old device with the Cen tral Limit theore m for dependen t pro ce sses, one can s ho w that f or an y constan t c , T − 1 / 2 cα ′ Z (1) ′ U ⇒ N (0 , c 2 σ ∗ u 2 α ′ Σ ∗ Z (1) 2 α ) and therefore T − 1 / 2 Z (1) ′ U ⇒ N ( 0 , σ ∗ u 2 Σ ∗ Z (1) 2 ) . The first term of the bias v anishes because ˆ θ (2) = o p (1) . The second term of the bias is also treated in the pro of of theorem 1 and is sho w to b e o ( λ 2 q 2 /T (1+ ρ ) / 2 + λ 1 /T 1+ ρ ) . By assumption λ 1 /T 1+ ρ → 0 and λ 2 q 2 /T (1+ ρ ) / 2 → 0 . Therefore the bias term con v erges to zero as T increases. 22

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment