From Cross-Validation to SURE: Asymptotic Risk of Tuned Regularized Estimators
We derive the asymptotic risk function of regularized empirical risk minimization (ERM) estimators tuned by $n$-fold cross-validation (CV). The out-of-sample prediction loss of such estimators converges in distribution to the squared-error loss (risk…
Authors: Karun Adusumilli, Maximilian Kasy, Ashia Wilson
F rom Cross-V alidation to SURE: Asymptotic Risk of T uned Regularized Estimators Karun Adusumilli ∗ Maximilian Kasy † Ashia Wilson ‡ Marc h 24, 2026 Abstract W e deriv e the asymptotic risk function of regularized empirical risk minimization (ERM) estimators tuned by n -fold cross-v alidation (CV). The out-of-sample prediction loss of suc h estimators conv erges in distri- bution to the squared-error loss (risk function) of shrink age estimators in the normal means mo del, tuned b y Stein’s un biased risk estimate (SURE). This risk function provides a more fine-grained picture of pre- dictiv e performance than uniform bounds on worst-case regret, whic h are common in learning theory: it quan tifies ho w risk v aries with the true parameter. As k ey in termediate steps, w e sho w that (i) n -fold CV conv erges uni- formly to SURE, and (ii) while SURE t ypically has multiple lo cal min- ima, its global minimum is generically well separated. W ell-separation ensures that uniform con v ergence of CV to SURE translates in to conv er- gence of the tuning parameter chosen b y CV to that c hosen by SURE. ∗ Departmen t of Economics, Universit y of Pennsylv ania. ak arun@sas.upenn.edu. † Departmen t of Economics, Universit y of Oxford. maximilian.k asy@economics.o x.ac.uk. Maximilian Kasy was supp orted by the Alfred P . Sloan F oundation, under the gran t “So cial foundations for statistics and machine learning.” ‡ Departmen t of Electrical Engineering and Computer Science, MIT. 1 1 In tro duction Bac kground The goal of sup ervised learning is to pro duce go o d predictions for new observ ations. 1 An imp ortant class of estimators for sup ervised learning can b e describ ed as regularized empirical risk minimization (ERM) estimators that are tuned using cross-v alidation (CV). ERM estimators minimize in-sample a v erage prediction loss (empirical risk), among a giv en class of predictors. Examples include ordinary least squares and maxim um likelihoo d. ERM estimators are prone to ov erfitting if the class of predictors is large. Suc h estimators achiev e lo w in-sample loss, but p erform p o orly for new observ ations. T o counter o v erfitting, regularization is used. Regularization adds a penalty term to the ERM ob jective; common penalties include the L 2 norm of parameters (in Ridge regression) and the L 1 norm (in Lasso regression). Adding a p enalty a v oids o v erfitting, by reducing the esti- mator v ariance, at the cost of in tro ducing some bias, whic h migh t result in underfitting. T o achiev e go o d p erformance, av oiding b oth o v erfitting and underfitting, the amoun t of p enalization needs to b e carefully tuned. This can b e done by c ho osing the weigh t on the p enalt y term as the minimizer of a cross-v alidation estimate of predictiv e loss. W e fo cus on n-fold CV, where predictions are ev al- uated for one hold-out observ ation at a time, and predictiv e loss is estimated b y a v eraging ev aluations o ver each of the n observ ations. Risk functions The presen t paper c haracterizes the b ehavior of estimators of this form b y deriving an asymptotic approximation to their risk function. A large literature in learning theory characterizes suc h estimators b y proving b ounds on their worst-case regret—the suprem um ov er data-generating pro- cesses (DGPs) of the difference b et w een an estimator’s risk and the exp ected loss of the b est predictor in the giv en class. Such b ounds provide strong ro- bustness guarantees, but they migh t not b e informative ab out the b ehavior of predictiv e algorithms for realistic DGPs. By fo cusing on the risk function, 1 W e w ould lik e to dedicate this pap er to Gary Chamberlain, whose con versations pro vided the original inspiration for this pro ject. 2 Figure 1: Risk function for JS-shrink age, dimension 10 1 2 3 4 5 6 0 . 5 1 ∥ θ ∥ M S E MLE JS-shrink age w e obtain a more fine-grained c haracterization: the risk function tells us ho w exp ected predictive p erformance dep ends on the DGP , while worst-case regret b ounds only characterize performance for the least fav orable DGP . Our asymptotic c haracterization relates the risk function of regularized ERM estimators tuned using CV to the risk function of the James-Stein (JS) shrink age estimator James and Stein (1961), and generalizations thereof. JS shrink age famously dominates maximum likelihoo d estimation (MLE) in the normal means setting: The risk function (mean squared error) of JS shrink- age, as characterized in Stein (1981), is lo w er than the risk of MLE, for an y p ossible DGP . Figure 1 illustrates. Our result suggests that this same risk impro v ement carries o v er, asymptotically , to CV-tuned p enalized estimation in general parametric mo dels. Like CV, SURE pro vides an un biased estimate of mean squared error; the JS estimator approximately minimizes SURE o v er the shrink age in tensit y . Main result Our main theorem states that the distribution of the out-of- sample prediction loss of CV-tuned regularized ERM estimators con v erges to the distribution of the squared-error loss of the corresp onding SURE-tuned shrink age estimator in the Gaussian limit exp eriment. In particular, the risk function (exp ected out-of-sample prediction loss, as a function of the true parameter) conv erges to the mean squared error of SURE-tuned p enalized estimation in the normal means mo del. Tw o of our in termediate results are of 3 indep enden t in terest: the uniform appro ximation of n -fold CV b y SURE, and the generic well-separation of the global minim um of SURE. Key steps Let us briefly outline the three main parts of our pro of. First, w e show that in large samples ERM estimators are approximately normally distributed, and that out-of-sample predictiv e loss is approximately equal to squared error loss. These are standard results, and we follow v an der V aart (2000) in pro ving this step. W e furthermore need to show that this approxi- mation carries o v er to p enalized estimators, for fixed tuning parameters. Our asymptotic appro ximations are based on lo cal-to-0 asymptotics: As sample size n increases, the parameter v ector drifts to 0 (which is the minimizer of the penalty term) at a rate of 1 / √ n . This rate is suc h that b oth bias and v ariance remain non-negligible in the limit. The drifting-parameter frame- w ork is natural for studying regularized estimators, because it is the regime in which the p enalty has a first-order effect: if the true parameter were fixed a w ay from 0, the p enalt y would b ecome asymptotically irrelev ant, while if it w ere exactly 0, no bias–v ariance tradeoff w ould arise. Second, w e need to sho w that n-fold CV, as a random function that maps tuning parameter v alues to estimates of predictive loss, con verges uniformly to SURE. In this step of the pro of, we build on the prior work of Wilson et al. (2020). This step inv olves an influence function approximation for leav e-one- out estimators, and a second-order appro ximation to predictive risk. Unifor- mit y of conv ergence in the tuning parameter is key to this step, and w e need to carefully sp ecify regularity conditions suc h that uniformit y is guaran teed. Third, we need to sho w that conv ergence of the CV criterion function for tuning is sufficient for conv ergence of its minimizer, the tuned parameter. This is non-trivial, b ecause b oth CV and SURE typically ha v e multiple local min- ima, and might hav e m ultiple global minima. W e need to show that generically the global minimum is well-separated. W e do this using separate argumen ts for L 1 and L 2 p enalties, c haracterizing the shap e and b ehavior of SURE in either case. W e should emphasize some limitations of our analysis: W e do not consider 4 p enalties b eyond L 1 and L 2 , or k -fold CV with k < n ; extending our results in these directions is left for future w ork. Our asymptotic approximations are furthermore not appropriate in the o ver-parametrized regime, when the n um b er of parameters is of similar or larger magnitude than the n um b er of observ ations. Literature The analysis in this paper connects sev eral lines of w ork in statis- tics, econometrics, and machine learning. Lea v e-one-out cross-v alidation was formalized b y Stone (1974); its asymptotic optimality for model selection w as established by Li (1987). A closely related family of risk estimators includes Mallo ws’ C p (Mallo ws, 1973), generalized cross-v alidation (Golub et al., 1979), and Stein’s un biased risk estimate (Stein, 1981). Efron (2004) pro vides a unify- ing p ersp ectiv e, showing that these criteria all tak e the form of in-sample error plus a cov ariance p enalty . Arlot and Celisse (2010) pro vide a comprehensive surv ey of cross-v alidation procedures and their theoretical prop erties. A note- w orth y con trast with our results is provided b y Shao (1993), who shows that lea v e-one-out CV is asymptotically inc onsistent for mo del selection—it selects o v erfitted mo dels with p ositive probability—while k -fold CV with k = o ( n ) is consisten t. Our result is complemen tary in fo cus: rather than asking which of a finite list of mo dels has the b est predictiv e ability , w e characterize the lim- iting distribution and risk function of the estimator selected b y leav e-one-out CV, showing it con v erges to that of the SURE-tuned normal-means estimator. Shrink age estimation originates with James and Stein (1961) and was an- alyzed in depth by Stein (1981). SURE-based tuning of shrink age w as ex- tended to w av elet thresholding b y Donoho and Johnstone (1995). The t wo p enalt y families we study—Ridge (Ho erl and Kennard, 1970) and Lasso (Tib- shirani, 1996)—are the most widely used forms of regularization in practice, and b oth are commonly tuned b y cross-v alidation. The close relationship b e- t w een lea v e-one-out CV and cov ariance-p enalty criteria such as SURE is fur- ther illuminated by Zou et al. (2007), who sho w—using SURE as the analytical to ol—that the effective degrees of freedom of the Lasso equals the num b er of nonzero fitted co efficients. 5 Using lo cal asymptotic framew orks to characterize decision problems was pioneered by Le Cam (1972); w ork using this approac h is review ed in Hirano and P orter (2020). The use of shrink age asymptotics for parametric models in econometrics is discussed in Hansen (2016). W e build directly on Wilson et al. (2020), who provide non-asymptotic deterministic guaran tees for ap- pro ximate cross-v alidation, sho wing that lea ve-one-out estimators can be w ell appro ximated b y a single Newton step from the full-sample estimator. Roadmap The remainder of this pap er is structured as follows: In Section 2, we in tro duce our mo del and assumptions, and define all relev an t notation. In Section 3, we first pro vide a heuristic outline of our pro of, and then state a series of intermediate lemmas. Proofs of all lemmas are collected in the app endix. In App endix A, w e pro v e the lemmas corresponding to the first part of our argumen t, inv olving influence function appro ximations and asymptotic normalit y . In App endix B, w e pro ve the second part of the argument, namely the (uniform) appro ximation of n-fold CV b y SURE. In App endices C and D, w e show that the global minimizer of SURE is generically well-separated for b oth L 2 and L 1 p enalties, thereb y pro ving the third part. In Appendix E, w e conclude our deriv ation, pro ving the conv ergence of risk for tuned estimators. 2 Setup In the follo wing, w e first set up our estimators, and their asymptotic counter- parts, in a series of definitions. W e then state the assumptions that will be in v oked to justify our asymptotic approximations. Throughout this pap er, we consider the problem of estimating a parameter v ector β 0 whic h is defined as the minimizer of expected loss E [ l ( β , Z )]. F or prediction problems, t ypically Z = ( W, Y ), for predictiv e features W and out- comes Y . Examples include linear OLS regression, where l ( β , Z ) = ( Y − W · β ) 2 , as w ell as the use of neural nets 2 or other parametric mo dels for classification, 2 With a small num b er of parameters relativ e to the sample size. 6 where l ( β , Z ) = − log( f ( Y | W, β )), and f ( Y | W , β ) is the probability assigned to outcome Y by the mo del. W e use lo cal-to-0 co ordinates, θ = √ n · β , for sample size n , and corresp ond- ingly θ 0 = √ n · β 0 . F or each n , the random vectors Z i n are i.i.d. dra ws from the distribution µ n , across i . Any finite-sample ob ject will be denoted by a subscript n ; ob jects without subscripts corresp ond to the limiting exp eriment. 2.1 Definitions Definition 1 in tro duces notation for loss functions and for limiting loss func- tions. W e ev aluate estimates of θ in terms of their exp ected loss ¯ L n ( θ , θ 0 ). F or sup ervised learning, ¯ L n ( θ , θ 0 ) is the out-of-sample exp ected prediction error. Definition 1 (Loss, empirical loss, and exp ected loss) . Given the loss function l ( β , z ) , define the fol lowing. l n ( θ , z ) = l ( θ / √ n, z ) L oss function in lo c al p ar ameter L n ( θ ) = n X i =1 l n ( θ , Z i n ) Empiric al loss ¯ L n ( θ , θ 0 ) = E [ L n ( θ ) − L n ( θ 0 )] Exp e cte d loss ¯ L ( θ , θ 0 ) = lim n →∞ ¯ L n ( θ , θ 0 ) . Limiting exp e cte d loss We assume that the se quenc e of distributions µ n is such that the limit in the last definition is wel l-define d. Scaling Note that we could hav e equiv alen tly defined ¯ L n ( θ , θ 0 ) = n · E l n ( θ , Z n +1 n ) − l n ( θ 0 , Z n +1 n ) , that is, ¯ L n ( θ , θ 0 ) is the exp ected regret for out-of-sample predictions, m ulti- plied b y the sample size n . The m ultiplication by n is required b ecause in Definition 1, w e do not scale empirical loss L n ( θ ) b y a factor 1 n , and therefore L n ( θ ) diverges. In the definition of the local parameter vector θ w e ha ve how- 7 ev er re-scaled the parameter vector β by a factor of 1 √ n . This implies that the Hessian (second deriv ative) of L n ( θ ) with resp ect to θ , if it exists, is giv en b y ∇ 2 θ L n ( θ ) = 1 n n X i =1 ∇ 2 β l ( θ / √ n, Z i n ) . W e can thus expect that this Hessian conv erges, by a law of large num b ers, under some additional regularity conditions. Estimators of θ and their asymptotic coun terparts W e next sp ecify a series of estimators, in Definition 2. W e start with the standard ERM es- timator ˆ θ n = argmin θ L n ( θ ), and its limiting coun terpart ˆ θ , where the latter is normally distributed with mean θ 0 and v ariance Σ. W e then consider the regularized v ersions of these estimators, that is, the penalized ERM estima- tor with p enalt y λ · π ( θ ), and its limiting coun terpart. Our choice of lo cal parametrization ensures that a constant v alue of λ along the sequence indexed b y n leads to a non-degenerate limit for the penalized ERM estimator, where neither v ariance nor bias of this estimator v anish. In Definition 2, we furthermore in tro duce lea ve-one-out loss, and the corre- sp onding lea v e-one-out ERM and penalized ERM estimator. These will serve as the building blo cks of n-fold cross-v alidation. Definition 2 (Estimators of θ 0 ) . We wil l c onsider the fol lowing estimators of θ 0 , for finite n , and in the limit exp eriment: ˆ θ n = argmin θ L n ( θ ) ERM estimator ˆ θ ∼ N ( θ 0 , Σ) Limiting ERM estimator ˆ θ λ n = argmin θ [ L n ( θ ) + λ · π ( θ )] Penalize d ERM estimator ˆ θ λ = argmin θ h 1 2 ∥ θ − ˆ θ ∥ 2 + λ · π ( θ ) i , Limiting p enalize d ERM estimator wher e π ( · ) is c onvex and attains its minimum at 0 . We furthermor e c onsider the fol lowing le ave-one-out (LOO) loss and estima- 8 tors of θ 0 : L − i n ( θ ) = X j = i l n ( θ , Z j n ) LOO empiric al loss ˆ θ − i n = argmin θ L − i n ( θ ) LOO ERM estimator ˆ θ λ, − i n = argmin θ L − i n ( θ ) + λ · π ( θ ) . LOO p enalize d ERM estimator W e can rewrite the limiting penalized ERM estimator as ˆ θ λ = ˆ θ + g λ ( ˆ θ ) , where g λ ( θ ) = argmin g 1 2 ∥ g ∥ 2 + λ · π ( θ + g ) . Denote ∇ g λ ( θ ) the deriv ative of g λ ( θ ) where it exists, and define ∇ g λ ( θ ) = 0 at points where g λ ( θ ) is not differentiable. 3 Estimators of risk The preceding definition in tro duced p enalized estima- tors for given, fixed v alues of the tuning parameter λ . W e are interested in estimators whic h c ho ose this tuning parameter in a data-dep endent wa y , where λ minimizes an estimator of risk. F or finite sample size n , w e consider the n- fold crossv alidation (CV) criterion as an estimator of the risk of p enalized ERM estimation. F or the limit exp eriment, we consider Stein’s Un biased Risk Estimator (SURE) as an estimator of the risk of the p enalized limiting ERM estimator. Definition 3 (Estimators of risk) . We c onsider the fol lowing estimators of 3 This con ven tion is adopted for conv enience; we will use it to handle Lasso ( L 1 ) p enalties in the pro of of Lemma 4 b elow. 9 risk for p enalize d ERM estimators with fixe d tuning p ar ameter λ . C V n ( λ ) = X i l n ( ˆ θ λ, − i n , Z i n ) , n-fold CV S U RE ( λ, ˆ θ , Σ) = trace(Σ) + ∥ g λ ( ˆ θ ) ∥ 2 + 2 trace ∇ g λ ( ˆ θ ) · Σ . SURE T uned estimators W e can no w formally define our tuned estimators. ˆ θ ∗ n is the p enalized ERM estimator using a tuning parameter λ ∗ n whic h minimizes the n-fold CV estimator of risk. ˆ θ ∗ is the p enalized limiting ERM estimator using a tuning parameter λ ∗ whic h minimizes the SURE estimator of risk. The tuning parameter is chosen from a set Λ ⊂ R . Later, we will consider Λ = R (for Ridge p enalties), and Λ arbitrary but finite (for Lasso p enalties). Definition 4 (T uned estimators of θ ) . ˆ θ ∗ n = ˆ θ λ ∗ n n , Penalize d ERM tune d using CV λ ∗ n = argmin λ ∈ Λ C V n ( λ ) ˆ θ ∗ = ˆ θ λ ∗ , Limiting p enalize d ERM tune d using SURE λ ∗ = argmin λ ∈ Λ S U RE ( λ, ˆ θ , Σ) . W e ev aluate estimators based on their exp ected loss for new data-p oints. F or sup ervised learning, this corresp onds to the out-of-sample exp ected predic- tion loss. In this pap er, we do not consider global criteria suc h as w orst-case risk (maximizing o ver θ 0 ) or Ba yes risk (a veraging ov er a prior distribution for θ 0 ). Instead, we are interested in the dependence of exp ected loss on the parameter θ 0 , whic h is captured b y the risk function. Our notation makes this dep endence on θ 0 explicit. The risk function giv es a more fine-grained picture of estimator p erformance, relativ e to global criteria suc h as w orst-case risk, Ba y es risk, or worst-case regret. The parameter θ 0 en ters the follo wing expressions b oth directly , as an ar- gumen t of ¯ L n and ¯ L , and implicitly , via the distribution of Z that the expec- tations are av eraging ov er. 10 Definition 5 (Risk functions) . R n ( θ 0 ) = E h ¯ L n ( ˆ θ ∗ n , θ 0 ) i Finite sample risk, tune d using CV R ( θ 0 ) = E h ¯ L ( ˆ θ ∗ , θ 0 ) i Limiting risk, tune d using SURE . 2.2 Assumptions Ha ving defined our estimators and ev aluation criteria, w e next sp ecify the assumptions inv ok ed in our asymptotic analysis. Assumption 1 sets up a sequence of exp eriments, indexed by n . W e assume that, for eac h n , the minimizer of exp ected loss E [ l ( β , Z i n )] is giv en by θ 0 / √ n . Put differently , the minimizer β of expected loss drifts tow ards 0. W e furthermore assume that the v ariance Σ of the score ∇ β l ( θ 0 / √ n, Z i n ) remains constant along our sequence. Assumption 1 (Sequence of exp eriments) . F or e ach n , the r andom ve ctors Z i n ar e i.i.d. dr aws fr om the distrib ution µ n , acr oss i . The distributions µ n ar e such that θ 0 and Σ do not dep end on n , wher e θ 0 = argmin θ E [ l ( θ / √ n, Z i n )] , Σ = V ar ∇ β l ( θ 0 / √ n, Z i n ) . The limiting Hessian H = ∇ 2 θ ¯ L ( θ , θ 0 ) is typically non-degenerate b ecause of our scaling of ¯ L n ( θ , θ 0 ) and of θ . The following Assumption 2 is made for notational conv enience. This assumption states that the Hessian H is equal to the iden tity I . This is a co ordinate normalization that can b e imposed without loss of generality . 4 Assumption 2 (Normalized loss function) . ∇ 2 θ ¯ L ( θ , θ 0 ) | θ = θ 0 = I . 4 By suitable c hoice of coordinates we can normalize either H , or the asymptotic v ariance Σ of Assumption 1, or the Hessian of the p enalty function π (when the latter exists), but not more than one of these three matrices, in general. After normalizing the Hessian, we can how ever diagonalize one more matrix, without loss of generality . 11 The last part of our pro of requires sho wing that the global optim um of S U RE with resp ect to λ is generically unique and w ell-separated. W e will pro v e this fact for b oth Ridge and Lasso p enalties, using separate arguments for either case. Assumption 3 (P enalt y function and grid for tuning) . The p enalty π ( θ ) and the set Λ take one of the fol lowing two forms: 1. R idge : π ( θ ) = 1 2 θ · A − 1 · θ , wher e A is p ositive definite, and Λ = R + . 2. L asso : π ( θ ) = ∥ A − 1 · θ ∥ 1 , wher e A is an invertible matrix, and Λ ⊂ R + is finite. The remaining assumptions state regularit y conditions. The first item in Assumption 4 is a condition on the loss function whic h allo ws us to inv oke results from empirical pro cess theory . Similar assumptions are inv oked in v an der V aart (2000) when deriving the prop erties of M-estimators. The second item in Assumption 4 is a weak high-lev el condition ruling out div ergence of ERM estimators, which ensures the applicabilit y of empirical pro cess results. Assumption 4 (Conditions for conv ergence of the ERM estimator) . 1. Lipschitz loss The loss function l ( β , z ) satisfies | l ( β 1 , z ) − l ( β 2 , z ) | ≤ m ( z ) · ∥ β 1 − β 2 ∥ , for al l β 1 , β 2 in a neighb orho o d of 0 , wher e sup n V ar( m ( Z i n )) ≤ ∞ . F ur- thermor e, l ( β , Z i n ) is differ entiable w.r.t. β , for al l β in a neighb orho o d of β = 0 , with pr ob ability 1 . 2. Sto chastic al ly b ounde d ERM estimator The se quenc e ˆ θ n = argmin θ L n ( θ ) is b ounde d in pr ob ability. T o prov e the (uniform) con vergence of C V n to S U RE , we finally imp ose the follo wing additional regularit y conditions. 12 Assumption 5 (Conditions for conv ergence of CV) . 1. Conditions on loss Ther e exist µ > 0 , ν < ∞ indep endent of n such that L n ( θ ) is µ -str ongly c onvex and has ν -smo oth Hessians with pr ob ability appr o aching 1 under µ n . 5 2. Conditions on sc or es The function √ n ∇ θ l n ( θ , Z i n ) is Lipschitz c ontinuous almost everywher e, i.e., ther e exists B n ( Z i n ) such that √ n ∇ θ l n ( θ , Z i n ) − √ n ∇ θ l n ( θ ′ , Z i n ) ≤ B n ( Z i n ) ∥ θ − θ ′ ∥ ∀ θ , θ ′ ∈ Θ and E µ n h ∥ B n ( Z i n ) ∥ 2 i < ∞ . A dditional ly, ther e exists M < ∞ indep en- dent of n such that E µ n h √ n ∇ θ l n θ 0 , Z i n 4 i ≤ M . 3. Conditions on Hessians The Hessian ∇ 2 θ l n ( θ , Z i n ) is such that 1 n n X i =1 n ∇ 2 θ l n θ 0 , Z i n 2 = O µ n (1) . F urthermor e, ther e exists C n ( Z i n ) such that n ∇ 2 θ l n ( θ , Z i n ) − n ∇ 2 θ l n ( θ ′ , Z i n ) ≤ C n ( Z i n ) ∥ θ − θ ′ ∥ ∀ θ , θ ′ ∈ Θ and sup n E µ n [ C n ( Z i n ) 2 ] < ∞ . 4. Conditions on F ourth Derivatives 5 L n ( θ ) is µ -strongly conv ex if ∇ 2 L n ( θ ) − µI is p ositive semi-definite for all θ . A function is L -smo oth if its gradients are Lipschitz con tinuous with Lipsc hitz constant L . 13 The fourth derivative tensor, D 4 θ l n ( θ , Z i n ) , of l n ( · , Z i n ) is such that sup n E µ n sup θ ∈ Θ n 2 D 4 θ l n θ , Z i n 4 ≤ M , for some M < ∞ . 3 Main result and in termediate lemmas Our main goal in this paper is to pro ve the follo wing result: Theorem 1. ¯ L n ( ˆ θ ∗ n , θ 0 ) → d 1 2 ∥ ˆ θ ∗ − θ 0 ∥ 2 . In words, the distribution of loss of the p enalized ERM estimator tuned using n-fold cross-v alidation conv erges to the distribution of squared error of the corresp onding shrink age estimator in the normal means model, tuned b y Stein’s Unbiased Risk Estimate. A sp ecial case of these limiting estimators are James-Stein shrink age estimators, for whic h closed form c haracterizations of the risk function are kno wn (Stein, 1981). An immediate corollary of The- orem 1 is the conv ergence of risk functions, sub ject to p ossible truncation of tail ev ents. 6 Corollary 1. L et M > 0 . Then E h min ¯ L n ( ˆ θ ∗ n , θ 0 ) , M i → E h min ∥ ˆ θ ∗ − θ 0 ∥ 2 , M i . 3.1 Outline of pro of W e will build up our argument that prov es Theorem 1 in a series of lemmas. Before doing so, how ever, we first provide an intuitiv e sketc h of our argumen t, while neglecting remainder terms. Subsequently , we will pro ve that these re- mainder terms are indeed asymptotically negligible. 6 T runcation is necessary b ecause, even for estimators such as linear OLS regression, risk is t ypically undefined, since some eigen v alues of the design matrix migh t be close to 0, so that the moments of ˆ θ n migh t not exist. 14 Influence function appro ximation W e start b y noting that empirical risk is asymptotically equiv alent to quadratic error loss relative to the sample mean ˜ θ n , L n ( θ ) ≈ const. + 1 2 ∥ θ − ˜ θ n ∥ 2 , ˜ θ n = θ 0 + 1 √ n X i X i n , where X i n = −∇ β l ( θ 0 / √ n, Z i n ) is the influence function. Recall that we hav e normalized the Hessian in As- sumption 2, which simplifies the expression for ˜ θ n . This approximation of empirical risk immediately implies an asymptotic linear appro ximation of the empirical risk minimization (ERM) estimator, ˆ θ n ≈ ˜ θ n . These are standard appro ximations that deliv er asymptotic normalit y of ERM estimators; see for instance Theorem 5.21 in v an der V aart (2000). W e then get the corresp onding approximation for the p enalized ERM esti- mator, for fixed λ , ˆ θ λ n ≈ ˜ θ λ n = ˜ θ n + g λ ( ˜ θ n ) , where w e recall the definition g λ ( θ ) = argmin g 1 2 ∥ g ∥ 2 + λ · π ( θ + g ). Con v ergence of CV to SURE An analogous approximation holds for lea v e-one-out (LOO) loss. T o obtain the LOO sample mean, the influence function 1 √ n X i n is subtracted from the sample mean ˜ θ n , whic h giv es L − i n ( θ ) ≈ const. + 1 2 ∥ θ − ˜ θ − i n ∥ 2 , where ˜ θ − i n = ˜ θ n − 1 √ n X i n . The penalized LOO estimator is then appro ximately giv en b y ˆ θ λ, − i n ≈ ˜ θ − i n + g λ ( ˜ θ − i n ) ≈ ˜ θ λ n − 1 √ n ( I + ∇ g λ ( ˜ θ n )) · X i n . In the last step w e hav e replaced g λ b y its first-order T aylor appro ximation around ˜ θ n , at points ˜ θ n where g λ is differen tiable. (This approximation w on’t 15 hold at kink-p oints of g λ , whic h exist for Lasso p enalties, in particular.) The n-fold cross-v alidation estimator of the risk of ˆ θ λ n can b e approximated b y C V n ( λ ) = X i l n ( ˆ θ λ, − i n , Z i n ) ≈ const. + 1 n X i ∥ ˆ θ λ, − i n − θ 0 − √ nX i n ∥ 2 . When w e tak e this expression, plug in the approximate form of ˆ θ λ, − i n , m ultiply out the inner pro ducts, and omit terms which do not depend on λ , w e obtain C V n ( λ ) ≈ const. + 1 n X i ∥ ˜ θ n + g λ ( ˜ θ n ) − 1 √ n ( I + ∇ g λ ( ˜ θ n )) · X i n | {z } ≈ ˆ θ λ, − i n − θ 0 − √ nX i n ∥ 2 ≈ const. + 1 n X i ∥ g λ ( ˜ θ n ) ∥ 2 + 2 n X i ⟨∇ g λ ( ˜ θ n ) · X i n , X i n ⟩ ≈ const. + ∥ g λ ( ˆ θ n ) ∥ 2 + 2 trace( ∇ g λ ( ˆ θ n ) · ˆ Σ n ) = const. + S U R E ( λ, ˆ θ n , ˆ Σ n ) ≈ const. + S U RE ( λ, ˆ θ n , Σ) , where ˆ Σ n is the sample second momen t of X n . In the second line, const. sub- sumes any terms that do not dep end on λ , while the appro ximation omits terms that depend on λ but are of order 1 / √ n . This appro ximation to C V n has the form of Stein’s Unbiased Risk Estimate. The first term in this ap- pro ximation to C V n ( λ ) corresp onds to the a v erage in-sample error, the second term has the form of a cov ariance p enalt y (Efron, 2004). Con v ergence of tuning parameter and tuned estimators W e will need to sho w that this appro ximation is uniformly v alid in λ . W e will furthermore need to show that uniform proximit y of C V n to S U RE is enough to guarantee pro ximit y of the corresp onding optimized tuning parameters, argmin λ C V n ( λ ) ≈ argmin λ S U RE ( λ, ˆ θ n , ˆ Σ n ) . This latter step is non-trivial, b ecause the criterion function S U RE ( λ, ˆ θ n , ˆ Σ n ) 16 Figure 2: Examples of multi-modality of S U R E Ridge 0 20 40 3 . 4 3 . 6 3 . 8 λ SURE Lasso 0 1 2 4 6 8 λ SURE Notes: These plots sho w examples of multi-modality for SURE, for the case of L 2 p enalties (Ridge) and L 1 p enalties (Lasso). Both examples are reproduced from Wilson et al. (2020). t ypically has multiple lo cal minima. F or certain v alues of ˆ θ n this function furthermore has multiple glob al minima. When S U R E has multiple global (near-)minima, uniform closeness of C V to S U RE is not enough to ensure closeness of the minimizer of C V to the minimizer of S U R E . The plots in Figure 2, whic h are repro duced from Wilson et al. (2020), 7 illustrate tw o n u- merical examples (realizations of ˆ θ n and of ˆ Σ n ) for whic h S U RE indeed has m ultiple global minima. Using separate arguments for Ridge (Appendix C) and Lasso (Appendix D), w e will prov e, ho wev er, that the global minim um of S U RE with resp ect to λ is unique and w ell separated almost ev erywhere, in a suitable sense. Put differen tly , cases suc h as those represen ted in Figure 2 are non-generic, suc h that they do not lead to a breakdo wn of con vergence for the optimized tuning parameter. The argumen ts proving that multiple global minima only o ccur on a set of measure 0 migh t b e the most non-standard part of our pro of W ell-separation ensures that the argmin functional is contin uous at almost ev ery realization of ˆ θ . F rom these results we th us conclude that the mapping from ˆ θ n and C V n to the tuned estimate ˆ θ ∗ n is almost everywhere con tinuous. 7 The numerical v alues corresp onding to these examples are as follows: (a) SURE for Ridge: ˆ θ = (1 . 3893 , 1 . 5), L ( θ ) = ( θ − ˆ θ ) diag(1 , 40)( θ − ˆ θ ), π ( θ ) = ∥ θ ∥ 2 , (b) SURE for Lasso: ˆ θ = 1 √ n ( p 1 / 8 , p 9 / 8 , 2), π ( θ ) = P | θ j | . 17 This allo ws us to in vok e the contin uous mapping and dominated conv ergence theorems, and to conclude the proof of Theorem 1. 3.2 In termediate lemmas Let us now turn to a more formal exp osition of our argument. W e will pro v e Theorem 1 in a series of Lemmas. The lemmas are stated in this section, their pro ofs in the app endices. All results impose the assumptions stated in Section 2. Lemma 1 (Lipsc hitz g λ ) . F or any λ ≥ 0 , if π ( · ) is c onvex then g λ ( θ ) = argmin g 1 2 ∥ g ∥ 2 + λ · π ( θ + g ) is Lipschitz with Lipschitz c onstant 1. Lemma 2 (Influence function approximation) . L n ( θ ) − L n ( θ 0 ) = 1 2 ∥ θ − ˜ θ n ∥ 2 + ϵ n ( θ ) , (1) wher e sup θ : ∥ θ ∥≤ C ϵ n ( θ ) = o µ n (1) and sup θ : ∥ θ ∥≤ C ∇ ϵ n ( θ ) = o µ n (1) for any C < ∞ , and ˜ θ n = θ 0 + 1 √ n X i X i n , X i n = −∇ β l ( θ 0 / √ n, Z i n ) . The ERM and p enalize d ERM estimators ˆ θ n = argmin θ L n ( θ ) , ˆ θ λ n = argmin θ [ L n ( θ ) + λ · π ( θ )] satisfy ˆ θ n = ˜ θ n + o µ n (1) , sup λ ∥ ˆ θ λ n − ˜ θ n − g λ ( ˜ θ n ) ∥ = o µ n (1) . Lemma 3 (Limiting squared error loss) . The limiting exp e cte d loss is wel l define d and given by ¯ L ( θ , θ 0 ) = 1 2 ∥ θ − θ 0 ∥ 2 . 18 Conver genc e of ¯ L n ( θ , θ 0 ) to this limit is uniform in any b ounde d neighb orho o d of θ 0 : sup θ : ∥ θ − θ 0 ∥≤ C | ¯ L n ( θ , θ 0 ) − ¯ L ( θ , θ 0 ) | → 0 for al l C < ∞ . Corollary 2 (Asymptotic distribution for fixed tuning parameter) . The ERM and p enalize d ERM estimators satisfy ˆ θ n → d ˆ θ ∼ N ( θ 0 , Σ) , ˆ θ λ n → d ˆ θ + g λ ( ˆ θ ) . Lemma 4 (Con v ergence of CV to SURE) . The n-fold cr ossvalidation criterion satisfies sup λ ∈ Λ C V n ( λ ) − S U RE ( λ, ˆ θ n , Σ) → µ n 0 . Lemma 5 (Join t con vergence of tuning parameter and tuned estimators) . ( λ ∗ n , ˆ θ n ) → d ( λ ∗ , ˆ θ ) . and ˆ θ ∗ n → d ˆ θ ∗ . F rom 5, we then sho w Theorem 1. The app endices pro ve eac h of these Lemmas in turn. 19 References Arlot, S. and Celisse, A. (2010). A surv ey of cross-v alidation pro cedures for mo del selection. Statistics Surveys , 4:40–79. Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smo oth- ness via wa velet shrink age. Journal of the A meric an Statistic al Asso ciation , 90(432):1200–1224. Efron, B. (2004). The estimation of prediction error: co v ariance p enal- ties and cross-v alidation. Journal of the A meric an Statistic al Asso ciation , 99(467):619–632. Golub, G. H., Heath, M., and W ahba, G. (1979). Generalized cross-v alidation as a method for choosing a go o d ridge parameter. T e chnometrics , 21(2):215– 223. Hansen, B. E. (2016). Efficient shrink age in parametric mo dels. Journal of Ec onometrics , 190(1):115–132. Hirano, K. and P orter, J. R. (2020). Asymptotic analysis of statistical decision rules in econometrics. In Handb o ok of Ec onometrics , pages 283–354. Elsevier BV. Ho erl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. T e chnometrics , 12(1):55–67. James, W. and Stein, C. (1961). Estimation with quadratic loss. In Pr o c e e dings of the fourth Berkeley symp osium on mathematic al statistics and pr ob ability , v olume 1, pages 361–379. Le Cam, L. (1972). Limits of exp erimen ts. In Pr o c e e dings of the Sixth Berkeley Symp osium on Mathematic al Statistics and Pr ob ability , v olume 1, pages 245– 261. Univ ersity of California Press. 20 Li, K.-C. (1987). Asymptotic optimalit y for c p , c l , cross-v alidation and general- ized cross-v alidation: Discrete index set. The A nnals of Statistics , 15(3):958– 975. Mairal, J. and Y u, B. (2012). Complexit y analysis of the lasso regularization path. arXiv pr eprint arXiv:1205.0079 . Mallo ws, C. L. (1973). Some comments on c p . T e chnometrics , 15(4):661–675. Rudin, W. (1991). Principles of mathematic al analysis . McGraw-Hill. Shao, J. (1993). Linear mo del selection by cross-v alidation. Journal of the A meric an Statistic al Asso ciation , 88(422):486–494. Stein, C. M. (1981). Estimation of the mean of a multiv ariate normal distri- bution. The Annals of Statistics , 9(6):1135–1151. Stone, M. (1974). Cross-v alidatory c hoice and assessment of statistical predic- tions. Journal of the R oyal Statistic al So ciety: Series B (Metho dolo gic al) , 36(2):111–133. Tibshirani, R. (1996). Regression shrink age and selection via the lasso. Journal of the R oyal Statistic al So ciety. Series B (Metho dolo gic al) , pages 267–288. v an der V aart, A. W. (2000). Asymptotic statistics . Cambridge Universit y Press. Wilson, A., Kasy , M., and Mack ey , L. (2020). Appro ximate cross-v alidation: Guaran tees for mo del assessmen t and selection. Pr o c e e dings of the 23r dInter- national Confer enc e on A rtificial Intel ligenc e and Statistics (AIST A TS) , 108. Zou, H., Hastie, T., and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. The Annals of Statistics , 35(5):2173–2192. 21 A Pro ofs: Influence function appro ximations A.1 Pro of of Lemma 1 (Lipsc hitz g λ ) Fix 0 ≤ λ < ∞ . Recall that g λ ( θ ) = argmin g 1 2 ∥ g ∥ 2 + λ · π ( θ + g ) . If π ( · ) is con v ex, then so is the ob jectiv e function on the righ t. This implies that there exists a sub-gradient ∇ π of π suc h that the first order condition g + λ · ∇ π ( θ + g ) = 0 holds for g = g λ ( θ ). Consider t w o v alues θ 1 , θ 2 of θ , and the corresp onding solutions g 1 , g 2 and sub-gradients ∇ π 1 , ∇ π 2 , as w ell as the differences ∆ θ = θ 2 − θ 1 and ∆ g = g 2 − g 1 . T aking the difference of the first order condition across the tw o v alues yields. ∆ g + λ · [ ∇ π 2 − ∇ π 1 ] = 0 . Con v exity of π implies ⟨∇ π 2 − ∇ π 1 , ∆ θ + ∆ g ⟩ ≥ 0 . Com bining the last tw o equations yields ⟨ ∆ g , ∆ θ + ∆ g ⟩ = − λ · ⟨∇ π 2 − ∇ π 1 , ∆ θ + ∆ g ⟩ ≤ 0 , and th us (using Cauc h y-Sc hw artz to get the second inequalit y), ∥ ∆ g ∥ 2 ≤ ⟨ ∆ g , − ∆ θ ⟩ ≤ ∥ ∆ g ∥ · ∥ ∆ θ ∥ , so that ∥ ∆ g ∥ ≤ ∥ ∆ θ ∥ . This pro ves that g λ ( θ ) is Lipschitz with Lipschitz-constan t 1. 22 A.2 Pro of of Lemma 2 (Influence function appro xima- tion) Assume first that Equation (1) holds, so that L n ( θ ) − L n ( θ 0 ) = 1 2 ∥ θ − ˜ θ n ∥ 2 + ϵ n ( θ ). W e sho w, under this assumption, that sup λ ∥ ˆ θ λ n − ˜ θ λ n ∥ 2 = o µ n (1). Lev er- aging con vexit y of π and Lipsc hitz contin uity of g λ , we first bound the corre- sp onding difference in p enalized squared error, whic h allo ws us to b ound the difference in squared error, and finally the difference b etw een the estimators themselv es. Bounding the difference in p enalized squared error loss Define ˜ θ λ n = argmin θ h 1 2 ∥ θ − ˜ θ n ∥ 2 + λ · π ( θ ) i = ˜ θ n + g λ ( ˜ θ n ) . By definition, ˆ θ λ n = argmin θ [ L n ( θ ) + λ · π ( θ )] and th us L n ( ˆ θ λ n ) + λ · π ( ˆ θ λ n ) ≤ L n ( ˜ θ λ n ) + λ · π ( ˜ θ λ n ) . Substituting for L n ( · ) on b oth sides of this inequalit y , using Equation (1) applied to b oth θ = ˆ θ λ n and θ = ˜ θ λ n , and rearranging yields h 1 2 ∥ ˆ θ λ n − ˜ θ n ∥ 2 + λ · π ( ˆ θ λ n ) i − h 1 2 ∥ ˜ θ λ n − ˜ θ n ∥ 2 + λ · π ( ˜ θ λ n ) i ≤ ϵ n ( ˜ θ λ n ) − ϵ n ( ˆ θ λ n ) . (2) Bounding the difference in squared error loss W e next prov e the follow- ing claim: Conv exity of π , and the definition ˜ θ λ n = argmin θ h 1 2 ∥ θ − ˜ θ n ∥ 2 + λ · π ( θ ) i , imply that, for any θ , 1 2 ∥ θ − ˜ θ λ n ∥ 2 ≤ h 1 2 ∥ θ − ˜ θ n ∥ 2 + λ · π ( θ ) i − h 1 2 ∥ ˜ θ λ n − ˜ θ n ∥ 2 + λ · π ( ˜ θ λ n ) i . (3) T o show (3), denote a ( θ ) = 1 2 ∥ θ − ˜ θ n ∥ 2 and b ( θ ) = λ · π ( θ ). W e can write 1 2 ∥ θ − ˜ θ λ n ∥ 2 = a ( θ ) − a ( ˜ θ λ n ) − ∇ a ( ˜ θ λ n ) · ( θ − ˜ θ λ n ) . 23 By conv exity of π and optimality of ˜ θ λ n , there exists a subgradient ∇ b of b such that ∇ a ( ˜ θ λ n ) + ∇ b ( ˜ θ λ n ) = 0 . Eliminating the common terms a ( θ ) − a ( ˜ θ λ n ) on the left and right hand side, w e can no w rewrite (3) as −∇ a ( ˜ θ λ n ) · ( θ − ˜ θ λ n ) ≤ b ( θ ) − b ( ˜ θ λ n ) . But since −∇ a ( ˜ θ λ n ) = ∇ b ( ˜ θ λ n ), this inequalit y holds b y conv exity of b and the definition of a subgradient, and the claim follows. Bounding the distance b et ween estimators Combining t w o inequalities (2) and (3) yields 1 2 ∥ ˆ θ λ n − ˜ θ λ n ∥ 2 ≤ ϵ n ( ˜ θ λ n ) − ϵ n ( ˆ θ λ n ) . It follo ws from Assumption 4 item 2, whic h states that ˆ θ λ n is b ounded in prob- abilit y , and the Lipsc hitzness of g λ (Lemma 1), whic h implies ∥ ˜ θ λ n ∥ ≤ 2 ∥ ˜ θ n ∥ , that b oth ˆ θ λ n and ˜ θ λ n are b ounded in probabilit y . Com bined with equation (1), w e consequen tly obtain that with probabilit y approac hing 1 under µ n , there exists C < ∞ suc h that sup λ 1 2 ∥ ˆ θ λ n − ˜ θ λ n ∥ 2 ≤ 2 sup ∥ θ ∥≤ C ϵ n ( θ ) ! = o (1) . The statemen t for the ERM estimator follows as a sp ecial case, where λ = 0. Pro ving Equation (1) , using empirical pro cess theory It remains to sho w that that (1) holds, where sup ∥ θ ∥≤ C ϵ n ( θ ) = o µ n (1). This claim follows from a straigh tforward generalization of the pro of of Lemma 19.31 in v an der V aart (2000) to the case of drifting distributions. Applicabilit y of argumen ts of Lemma 19.31 in v an der V aart (2000) is guaran teed b y the conditions in Assumption 4, item 1. In particular, p oint wise 24 con v ergence, for fixed θ , follows from almost sure differentiabilit y of l , by dominated conv ergence, giv en the uniform bound on the v ariance of m ( Z i n ) in Assumption 4. T o get uniform conv ergence across v alues of θ in an y ball of radius δ around 0, tigh tness needs to be shown. Tigh tness follows from a b ound on the brac keting n um b er of the class of functions { √ n ( l n ( θ , · ) − l n (0 , · )) : ∥ θ ∥ ≤ δ } . The b ound in the pro of of Lemma 19.31 in v an der V aart (2000) applies v erbatim, with a constant C that do es not dep end on n , based on the uniform b ound on the v ariance of m ( Z i n ) in Assumption 4. The claim follo ws. The claim that sup θ : ∥ θ ∥≤ C ∇ ϵ n ( θ ) = o µ n (1) follo ws from the same argumen t, applied to ∇ θ l n ( θ , Z i n ), using the condition on scores in item 2 of Assumption 5. A.3 Pro of of Lemma 3 (Limiting squared error loss) By Definition 1, ¯ L n ( θ , θ 0 ) = E [ L n ( θ ) − L n ( θ 0 )], and ¯ L n ( θ , θ 0 ) is minimized at θ = θ 0 . By a second order T a ylor expansion around θ = θ 0 , ¯ L n ( θ , θ 0 ) = 1 2 ( θ − θ 0 ) · ∇ 2 θ ¯ L n ( ˜ θ , θ 0 ) · ( θ − θ 0 ) . for some ˜ θ b etw een θ and θ 0 . By Assumption 2, ∇ 2 θ ¯ L ( θ , θ 0 ) | θ = θ 0 = I . By definition, ∇ 2 θ ¯ L n ( θ , θ 0 ) = ∇ 2 β E l ( β , Z i n ) β = θ / √ n . The claim of the lemma then follows from con tin uit y of the Hessian of ∇ 2 β E [ l ( β , Z i n )] at β = 0. Con tin uity of the Hessian follows from item 3 of Assumption 5: ∇ 2 β E l ( β , Z i n ) β = θ / √ n − ∇ 2 β E l ( β , Z i n ) β =0 ≤ E h ∇ 2 β l ( β , Z i n ) β = θ / √ n − ∇ 2 β l ( β , Z i n ) β =0 i ≤ E C n ( Z i n ) · ∥ θ ∥ √ n , where sup n E [ C n ( Z i n )] < ∞ ; this follows from sup n E [ C n ( Z i n ) 2 ] < ∞ (Assump- tion 5.3) via Jensen’s inequalit y . 25 A.4 Pro of of Corollary 2 (Asymptotic distribution for fixed tuning parameter) Recall that V ar( X 1 n ) = Σ is constan t in n , by assumption. Note furthermore that Assumption 4 (item 1) implies the Lindeb erg condition E ∥ X 1 n ∥ 2 · 1 ( ∥ X 1 n ∥ > √ nM ) → 0 for all M > 0, since ∥ X 1 n ∥ ≤ m ( Z 1 n ): X i n = −∇ β l ( θ 0 / √ n, Z i n ), and the Lip- sc hitz condition in Assumption 4.1 together with a.e. differentiabilit y implies ∥∇ β l ( β , Z i n ) ∥ ≤ m ( Z i n ) at all p oin ts of differen tiabilit y , and the v ariance of the latter is uniformly b ounded. The Lindeberg-F eller cen tral limit theorem (Prop osition 2.27 in v an der V aart 2000), applied to the triangular array ( X i n ), therefore implies ˜ θ n → d N ( θ 0 , Σ). The claims of Corollary 2 then follow from Lemma 2, and the contin u- ous mapping theorem, where contin uity of g λ follo ws from con v exity of π , b y Lemma 1. 26 B Pro of of Lemma 4 Step 0 (Preliminary observ ations): W e start by stating some useful results for the pro of. First, note that by Lemma 1 in Wilson et al. (2020) and Assumption 5(i), it follo ws sup λ ˆ θ λ, − i n − ˆ θ λ n = O µ n 1 µ ∇ θ l n ( ˆ θ n , Z i n ) = O µ n ( n − 1 / 2 ) . (4) An analogous argument implies sup λ ˜ θ λ, − i n − ˜ θ λ n = O µ n ( n − 1 / 2 ) , (5) where ˜ θ λ, − i := argmin θ h 1 2 ∥ θ − ˜ θ − i n ∥ 2 + λ · π ( θ ) i . (6) Second, recall from Lemma 2 that sup λ ˆ θ λ n − ˜ θ λ n = o µ n (1) . (7) The next set of results concern the prop erties of g λ ( · ). Since g λ ( · ) is Lip- sc hitz contin uous by Lemma 1, it is differentiable almost ev erywhere (by Rademac her’s theorem). In particular, there exists an R λ ( · ; θ ) suc h that g λ ( θ + δ ) = g λ ( θ ) + ∇ g λ ( θ ) ⊺ δ + R λ ( δ ; θ ) , (8) and lim ∥ δ ∥→ 0 R λ ( δ ; θ ) ∥ δ ∥ = 0 for eac h λ and (Leb esgue) almost every θ. (9) 27 In fact, under Assumption 3, w e can strengthen (9) to: lim ∥ δ ∥→ 0 sup λ ∈ Λ R λ ( δ ; θ ) ∥ δ ∥ = 0 for (Lebesgue) almost ev ery θ . (10) F or Ridge, (10) is immediate, since R λ ( δ, θ ) = 0. F or Lasso, it follows from Lemma 8, whic h implies R λ ( δ, θ ) = 0 for δ small enough, except on a set of θ v alues with Leb esgue measure 0. F or v alues of θ where ∇ g λ ( θ ) do es not exist, w e somewhat arbitrarily set ∇ g λ ( θ ) = 0 and define R λ ( δ ; θ ) = g λ ( θ + δ ) − g λ ( θ ) for these v alues; that w ay (8) alwa ys holds. Observe that due to Lemma 1, g λ ( θ + δ ) − g λ ( θ ) ≤ ∥ δ ∥ and ∇ g λ ( θ ) ≤ 1 (whenev er the gradien t exists), so sup λ,θ R λ ( δ ; θ ) ≤ 2 ∥ δ ∥ . (11) Step 1: W e first sho w that CV n ( λ ) = n X i =1 l n ˆ θ λ n , Z i n + 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E + o µ n (1) , (12) uniformly o ver λ . By a first order T aylor expansion, CV n ( λ ) = n X i =1 l n ˆ θ λ, − i n , Z i n = n X i =1 l n ˆ θ λ n , Z i n + 1 √ n n X i =1 D ˆ θ λ, − i n − ˆ θ λ n , √ n ∇ θ l n ˆ θ λ n , Z i n E + 1 √ n n X i =1 r n ˆ θ λ, − i n , ˆ θ λ n , Z i n , where r n ˆ θ λ, − i n , ˆ θ λ n , Z i n ≤ B n ( Z i n ) · ˆ θ λ, − i n − ˆ θ λ n 2 28 b y Assumption 5. Hence, by the Cauch y-Sch warz inequality , 1 √ n n X i =1 r n ˆ θ λ, − i n , ˆ θ λ n , Z i n ≤ 1 n n X i =1 B n ( Z i n ) 2 ! 1 / 2 n X i =1 ˆ θ λ, − i n − ˆ θ λ n 4 ! 1 / 2 . The first term on the righ t hand side of the ab ov e expression is O µ n (1) b y Assumption 5, while the second term is O µ n ( n − 1 / 2 ) b y (4) and the t w o re- quiremen ts of Assumption 5 since sup λ n X i =1 ˆ θ λ, − i n − ˆ θ λ n 4 ≤ 1 n 2 µ 4 n X i =1 √ n ∇ θ l n ( ˆ θ n , Z i n ) 4 ≤ 8 n 2 µ 4 n X i =1 √ n ∇ θ l n ( ˆ θ n , Z i n ) − √ n ∇ θ l n ( θ 0 , Z i n ) 4 + 8 n 2 µ 4 n X i =1 √ n ∇ θ l n ( θ 0 , Z i n ) 4 = O µ n ( n − 1 ) , (13) so the expression ov erall is O µ n ( n − 1 / 2 ). W e now show that, uniformly o ver λ , one can appro ximate 1 √ n n X i =1 D ˆ θ λ, − i n − ˆ θ λ n , √ n ∇ θ l n ˆ θ λ n , Z i n E . (14) with 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E . T o this end, w e first argue that (14) can be approximated with 1 √ n n X i =1 D ˆ θ λ, − i n − ˆ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E . 29 By the Cauch y-Sch warz inequalit y , the approximation error is b ounded b y sup λ ∈ Λ n X i =1 ˆ θ λ, − i n − ˆ θ λ n 2 ! 1 / 2 · sup λ ∈ Λ 1 n n X i =1 √ n ∇ θ l n ˆ θ λ n , Z i n − √ n ∇ θ l n ˜ θ λ n , Z i n 2 ! 1 / 2 . (15) The first term in (15) is O µ n (1) by (4) and Assumption 5 (the argumen t is analogous to 13). The second term in (15) is o µ n (1) under (7) and Assump- tion 5(ii). It then remains to sho w sup λ ∈ Λ 1 √ n n X i =1 D ˆ θ λ, − i n − ˆ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E − 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E = o µ n (1) . By the Cauc h y-Sc hw arz inequality , the expression on the left is b ounded b y sup λ ∈ Λ n X i =1 ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n 2 ! 1 / 2 · sup λ 1 n n X i =1 √ n ∇ θ l n ˜ θ λ n , Z i n 2 ! 1 / 2 . The second term in the ab ov e expression is O µ n (1) b y Assumption 5(ii) (4th momen t bound on scores) and the Lipsc hitz condition, since ˜ θ λ n is bounded in probabilit y . A t the end of this pro of, w e analyze the first term, sho wing that sup λ ∈ Λ n X i =1 ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n 2 ! = o µ n (1) . Com bining the abov e results pro ves (12). 30 Step 2: Next, w e sho w that uniformly ov er λ , the term 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E (16) in (12) can b e appro ximated b y 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , − X i n E . Indeed, under Assumption 5(iii), w e ha ve 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n ˜ θ λ n , Z i n E − 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , − X i n E = 1 n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , n ∇ 2 θ l n θ 0 , Z i n · ˜ θ λ n − θ 0 E + 1 n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , ∆ n ˜ θ λ n − θ 0 , Z i n E , (17) where ∆ n ˜ θ λ n − θ 0 , Z i n ≤ C n ( Z i n ) · ˜ θ λ n − θ 0 2 , and the function C n ( · ) is defined in Assumption 5(iii). By the Cauch y-Sch warz inequalit y , Assumption 5 and (5), sup λ ∈ Λ 1 n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , n ∇ 2 θ l n θ 0 , Z i n · ˜ θ λ n − θ 0 E ≤ n − 1 / 2 · sup λ ∈ Λ n X i =1 ˜ θ λ, − i n − ˜ θ λ n 2 ! 1 / 2 · sup λ ∈ Λ 1 n n X i =1 n ∇ 2 θ l n θ 0 , Z i n 2 ! 1 / 2 · sup λ ∈ Λ ˜ θ λ n − θ 0 = n − 1 / 2 · O µ n (1) · O µ n (1) · O µ n (1) = O µ n ( n − 1 / 2 ) . This pro ves that the first term in the righ t hand side of (17) is o µ n (1) uniformly o v er λ . By an analogous argument, the second term in the right hand side of (17) is also o µ n (1) uniformly ov er λ . 31 Step 3: It th us remains to sho w 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n θ 0 , Z i n E is asymptotically equiv alent to the degrees of freedom term in SURE. By the definition of R λ ( · ), w e ma y write ˜ θ λ, − i n − ˜ θ λ n = − 1 √ n X i n − 1 √ n ∇ g λ ˜ θ n ⊺ X i n + R λ ( ˜ θ − i n − ˜ θ n ; ˜ θ n ) . W e can th us expand 1 √ n n X i =1 D ˜ θ λ, − i n − ˜ θ λ n , √ n ∇ θ l n θ 0 , Z i n E = 1 n n X i =1 − X i n , √ n ∇ θ l n θ 0 , Z i n + 1 n n X i =1 D ∇ g λ ˜ θ n ⊺ X i n , − √ n ∇ θ l n θ 0 , Z i n E + 1 √ n n X i =1 D R λ ( ˜ θ − i n − ˜ θ n ; ˜ θ n ) , √ n ∇ θ l n θ 0 , Z i n E . (18) The first term in (18) is indep endent of λ and can therefore b e neglected. The second term in (18) is asymptotically equiv alent to the degrees of freedom term in SURE. Indeed, since ˜ θ − i n − ˜ θ n = ∇ θ l n ( θ 0 , Z i n ), w e can write 1 n n X i =1 D ∇ g λ ˜ θ n ⊺ X i n , − √ n ∇ θ l n θ 0 , Z i n E = 1 n n X i =1 D ∇ g λ ˜ θ n ⊺ X i n , X i n E = T r h ∇ g λ ˜ θ n ⊺ ˆ Σ n i , where ˆ Σ n := 1 n n X i =1 ( X i n )( X i n ) ⊺ . 32 But by the la w of large of n umbers, whic h can b e applied here due to Assump- tion 5(ii), ˆ Σ n = Σ + o µ n (1). W e th us conclude that 1 n n X i =1 D ∇ g λ ˜ θ n ⊺ X i n , − √ n ∇ θ l n θ 0 , Z i n E = T r h ∇ g λ ˜ θ n ⊺ Σ i + o µ n (1) , uniformly o ver λ ∈ Λ. It remains to show the third term in (18) is negligible, i.e., 1 √ n n X i =1 D R λ ( ˜ θ − i n − ˜ θ n ; ˜ θ n ) , √ n ∇ θ l n θ 0 , Z i n E = o µ n (1) . Recall that ˜ θ − i n − ˜ θ n = − X i n / √ n . Fix some a ∈ (0 , 1 / 2) and C < ∞ , and expand 1 √ n n X i =1 D R λ ( ˜ θ − i n − ˜ θ n ; ˜ θ n ) , √ n ∇ θ l n θ 0 , Z i n E = 1 √ n n X i =1 D R λ ( − X i n / √ n ; ˜ θ n ) , − X i n E = 1 √ n n X i =1 D R λ ( − X i n / √ n ; ˜ θ n ) · I X i n ≥ C n a , − X i n E + 1 √ n n X i =1 D R λ ( − X i n / √ n ; ˜ θ n ) · I X i n < C n a , − X i n E := A λ n 1 + A λ n 2 . W e analyze the terms A λ n 1 and A λ n 2 separately . F or the term A λ n 1 , observe that b y (11), sup λ ∈ Λ | A λ n 1 | ≤ 2 n n X i =1 X i n 2 · I X i n ≥ C n a . Consequen tly , under Assumption 5(ii) and the given choice of a , E µ n | A λ n 1 | ≤ 2 C 2 n 2 a E µ n h X i n 4 i → 0 as n → ∞ . Th us, sup λ ∈ Λ A λ n 1 = o µ n (1). Next, w e sho w sup λ ∈ Λ A λ n 2 = o µ n (1). Due to exc hangeabilit y ov er i , this 33 follo ws if w e sho w that lim n →∞ √ n E µ n h D R λ ( − X i n / √ n ; ˜ θ n ) · I Γ i , − X i n E i = 0 , (19) where we use I Γ i as a short-hand for I {∥ X i n ∥ < C n a } . No w, b y the Cauc hy- Sc h warz inequality , √ n E µ n h D R λ ( − X i n / √ n ; ˜ θ n ) · I Γ i , − X i n E i ≤ E 1 / 2 µ n sup λ ∈ Λ R λ − X i n / √ n ; ˜ θ n 2 ∥− X i n / √ n ∥ 2 · I Γ i · E 1 / 2 µ n h X i n 4 i . But E 1 / 2 µ n h ∥ X i n ∥ 4 i ≤ M < ∞ under Assumption 5(ii), so (19) would follow if w e sho w that lim n →∞ E µ n sup λ ∈ Λ R λ − X i n / √ n ; ˜ θ n 2 ∥− X i n / √ n ∥ 2 · I Γ i = 0 . (20) T o pro ve (20), observ e that it is without loss of generalit y to supp ose R λ ( δ ; θ ) = R λ ( ∥ δ ∥ ; θ ), i.e., that it dep ends only on ∥ δ ∥ , and that R λ ( ∥ δ ∥ ; θ ) / ∥ δ ∥ is increasing in ∥ δ ∥ . Otherwise, w e can simply define ¯ R λ ( ∥ δ ∥ ; θ ) = sup ∥ δ ′ ∥≤∥ δ ∥ R λ ( δ ′ ; θ ) ∥ δ ′ ∥ , and this would satisfy these t w o conditions while still retaining the prop erty (9). Consequen tly , the left hand side of (20) can b e bounded as E µ n sup λ ∈ Λ R λ ∥ X i n / √ n ∥ ; ˜ θ n 2 ∥ X i n / √ n ∥ 2 · I Γ i ≤ E µ n sup λ ∈ Λ R λ C n a − 1 2 ; ˜ θ n C n a − 1 2 2 = E µ n sup λ ∈ Λ B λ ( C n a − 1 2 , ˜ θ n ) 2 , 34 where B λ ( δ, ˜ θ n ) := R λ δ ; ˜ θ n ∥ δ ∥ . W e now b ound E µ n sup λ ∈ Λ B λ ( C n a − 1 2 , ˜ θ n ) 2 . Fix some ϵ > 0. By the requiremen t that R λ ( ∥ δ ∥ ; θ ) / ∥ δ ∥ is increasing in ∥ δ ∥ , along with the fact a < 1 / 2, there exists ¯ n large enough so that B λ ( C n a − 1 2 , θ ) ≤ B λ ( ϵ, θ ) for eac h n ≥ ¯ n , θ ∈ R d and λ ∈ Λ. Now, it is straightforw ard to show ˜ θ n d − → µ n Z ∼ N ( θ 0 , Σ) . W e then ha v e lim n →∞ E µ n sup λ ∈ Λ B λ ( C n a − 1 2 , ˜ θ n ) 2 ≤ lim n →∞ E µ n sup λ ∈ Λ B λ ( ϵ, ˜ θ n ) 2 = E sup λ ∈ Λ B λ ( ϵ, Z ) 2 , where the equalit y follows from the properties of weak con v ergence since equa- tion (11) implies sup λ ∈ Λ B λ ( ϵ, θ ) ≤ 2 uniformly o ver θ . But (10) implies lim ϵ ′ → 0 sup λ ∈ Λ B λ ( ϵ ′ , θ ) = 0 for ev ery θ ∈ R d excluding a set of Lebesgue mea- sure 0. Since the Gaussian distribution is absolutely con tin uous with resp ect to the Lebesgue measure, it then follo ws b y the dominated conv ergence theorem that lim ϵ ′ → 0 E sup λ ∈ Λ B λ ( ϵ ′ , Z ) 2 = 0 . Since ϵ > 0 w as arbitrary , we conclude lim n →∞ E µ n sup λ ∈ Λ B λ ( C n a − 1 2 , ˜ θ n ) 2 = 0 . This pro ves (20). 35 It remains to prov e the claim, made in Step 2, that sup λ ∈ Λ n X i =1 ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n 2 ! = o µ n (1) . F or the remainder of this pro of, we mak e a case distinction b etw een Ridge p enalties and Lasso p enalties. Step 4 (Ridge): W e first sp ecialize to the case of quadratic p enalties. Define ˜ L − i n ( θ ) = 1 2 θ − ˜ θ − i n 2 . Observ e that ˜ θ λ, − i n = argmin θ n ˜ L − i n ( θ ) + λπ ( θ ) o . Consequen tly , ∇ θ n ˜ L − i n ( ˆ θ λ, − i n ) + λπ ( ˆ θ λ, − i n ) o = ∇ θ n ˜ L − i n ( ˆ θ λ, − i n ) + λπ ( ˆ θ λ, − i n ) o − ∇ θ n ˜ L − i n ( ˜ θ λ, − i n ) + λπ ( ˜ θ λ, − i n ) o = n ˆ θ λ, − i n − ˜ θ λ, − i n o + λ ∇ θ n π ( ˆ θ λ, − i n ) − π ( ˜ θ λ, − i n ) o . But ∇ θ n L − i n ( ˆ θ λ, − i n ) + λπ ( ˆ θ λ, − i n ) o = 0, so w e obtain n ˆ θ λ, − i n − ˜ θ λ, − i n o + λ ∇ θ n π ( ˆ θ λ, − i n ) − π ( ˜ θ λ, − i n ) o = ∇ θ ˜ L − i n ( ˆ θ λ, − i n ) − ∇ θ L − i n ( ˆ θ λ, − i n ) . In a similar vein, n ˆ θ λ n − ˜ θ λ n o + λ ∇ θ n π ( ˆ θ λ n ) − π ( ˜ θ λ n ) o = ∇ θ ˜ L n ( ˆ θ λ n ) − ∇ θ L n ( ˆ θ λ n ) . When π ( θ ) is the ridge p enalt y 1 2 θ ⊺ A − 1 θ , w e hav e ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n = ( I + λA − 1 ) − 1 n ∇ θ ˜ L − i n − L − i n ( ˆ θ λ, − i n ) − ∇ θ ˜ L n − L n ( ˆ θ λ n ) o . 36 By a third order T aylor expansion, L − i n ( θ ) − L − i n ( θ 0 ) = 1 √ n √ n ∇ θ L − i n ( θ 0 ) ⊺ ( θ − θ 0 ) + 1 2 n ( θ − θ 0 ) ⊺ n ∇ 2 θ L − i n ( θ 0 ) ( θ − θ 0 ) + 1 6 n 3 / 2 D 3 θ L − i n ( ¯ θ )[ θ − θ 0 , θ − θ 0 , θ − θ 0 ] , for some ¯ θ b etw een θ and θ 0 . A t the same time, ˜ L − i n ( θ ) − ˜ L − i n ( θ 0 ) = 1 √ n √ n ∇ θ L − i n ( θ 0 ) ⊺ ( θ − θ 0 ) + 1 2 ( θ − θ 0 ) ⊺ ( θ − θ 0 ) . Therefore, b y Assumption 5(iv), whic h implies E µ n sup θ ∈ Θ n 2 D 4 θ l n θ , Z i n 4 ≤ M , w e obtain ∇ θ ˜ L − i n − L − i n ( θ ) = ∇ 2 θ L − i n ( θ 0 ) − I ( θ − θ 0 ) + O µ n ( n − 3 / 2 ) . In a similar vein, ∇ θ ˜ L n − L n ( θ ) = ∇ 2 θ L n ( θ 0 ) − I ( θ − θ 0 ) + O µ n ( n − 3 / 2 ) . T aken together, we conclude n ∇ θ ˜ L − i n − L − i n ( ˆ θ λ, − i n ) − ∇ θ ˜ L n − L n ( ˆ θ λ n ) o = ∇ 2 θ L − i n ( θ 0 ) − I ˆ θ λ, − i n − θ 0 − ∇ 2 θ L n ( θ 0 ) − I ˆ θ λ n − θ 0 + O µ n ( n − 3 / 2 ) = − 1 n n ∇ 2 θ l n ( θ 0 , Z i n ) ˆ θ λ, − i n − θ 0 + ∇ 2 θ L n ( θ 0 ) − I ˆ θ λ, − i n − ˆ θ λ n + O µ n ( n − 3 / 2 ) . 37 Hence, n X i =1 ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n 2 ≤ 2 n 2 X i n ∇ 2 θ l n ( θ 0 , Z i n ) 2 · ˆ θ λ, − i n − θ 0 2 ! + 2 ∇ 2 θ L n ( θ 0 ) − I 2 · X i ˆ θ λ, − i n − ˆ θ λ n 2 ! + O µ n ( n − 3 / 2 ) = o µ n (1) . Step 4 (Lasso): W e finally pro v e that for the Lasso p enalt y , w e again ha ve that D D n := n X i =1 ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n 2 ! = o µ n (1) . W e consider the case of fixed λ ; taking the suprem um ov er λ is trivial when Λ is finite. W e will also assume for notational simplicit y that A = I , so that h = θ ; the general case (where h = A − 1 · θ ) follo ws immediately . Let no w D i n := ˆ θ λ, − i n − ˜ θ λ, − i n − ˆ θ λ n − ˜ θ λ n . Denote η = sig n ( ˜ θ λ n ), and the set of activ e co ordinates as J = { j : η j = 0 } . Define the following three even t indicators, where the sig n function is applied comp onen t-wise, and ρ n > 0 for some deterministic sequence ρ n suc h that ρ n · n 1 / 4 → ∞ and ρ n → 0: A n = 1 sig n ( b θ λ, − i n ) = sig n ( ˜ θ λ, − i n ) = sig n ( b θ λ n ) = η for all i , B n = (1 − A n ) · 1 sig n ( b θ λ n ) = η or sig n ( ˜ θ n + t + g λ ( ˜ θ n + t )) = η for some ∥ t ∥ < ρ n C n = 1 − A n − B n . 38 Then, b y construction, A n + B n + C n = 1 for all n, i , so that D D n = A n · n X i =1 ∥ D i n ∥ 2 | {z } First sum + B n · n X i =1 ∥ D i n ∥ 2 | {z } Second sum + C n · n X i =1 ∥ D i n ∥ 2 | {z } Third sum . T o show that D D n → 0, w e will consider eac h of these three sums separately . First sum Conditional on A n = 1, the signs for all estimators of θ un- der consideration coincide with η . The first order conditions for the activ e co ordinates for each of the estimators can therefore b e written as ∇ J L n ( b θ λ n ) + λ · η J = 0 ∇ J L n ( b θ λ, − i n ) − ∇ J l n ( b θ λ, − i n , Z n i ) + λ · η J = 0 ( ˜ θ λ n − ˜ θ n ) J + λ · η J = 0 ( ˜ θ λ, − i n − ˜ θ n ) J − ∇ J l n ( θ 0 , Z n i ) + λ · η J = 0 . T aking differences of the first tw o and of the second t w o equations, w e get ∇ J L n ( b θ λ n ) − ∇ J L n ( b θ λ, − i n ) = ∇ J l n ( b θ λ, − i n , Z n i ) , ( ˜ θ λ n − ˜ θ λ, − i n ) J = ∇ J l n ( θ 0 , Z n i ) . The first of these equations can b e rewritten as ( b θ λ n − b θ λ, − i n ) J = ( ∇ 2 J L n ( ¯ θ i n )) − 1 · ∇ J l n ( b θ λ, − i n , Z n i ) for some in termediate point ¯ θ i . Com bining, w e get D i n,J c = 0 (for the inactiv e co ordinates, the double difference v anishes) and D i n,J = h ( ∇ 2 J L n ( ¯ θ i n )) − 1 · ∇ J l n ( b θ λ, − i n , Z n i ) i − ∇ J l n ( θ 0 , Z n i ) = ( ∇ 2 J L n ( ¯ θ i n )) − 1 − I J · ∇ J l n ( b θ λ, − i n , Z n i ) + h ∇ J l n ( b θ λ, − i n , Z n i ) − ∇ J l n ( θ 0 , Z n i ) i . 39 and th us, conditional on A n = 1, n X i =1 ∥ D i n ∥ 2 = n X i =1 ∥ D i n,J ∥ 2 ≤ 2 max i ( ∇ 2 J L n ( ¯ θ i n )) − 1 − I J 2 · X i ∇ l n ( b θ λ, − i n , Z n i ) 2 + 2 X i ∇ l n ( b θ λ, − i n , Z n i ) − ∇ l n ( θ 0 , Z n i ) 2 . Note that w e dropp ed the J subscript on gradien ts when taking the upp er b ound. The max term go es to 0 in probabilit y b y Lemma 2. The first sum is O p (1) given the b ound on 4th momen ts of the score in Assumption 5.2. The second sum is o p (1) b y the Lipschitz condition in Assumption 5.2 and b y b θ λ, − i n − θ 0 = O p 1 √ n . It follo ws that A n · P n i =1 ∥ D i n ∥ 2 = o p (1). Second sum T o con trol the second sum, we next sho w that B n = o µ n (1). Recall the following results that w ere sho wn previously: • By Lemma 2, b θ λ n − ˜ θ λ n → p 0. • Also b y Lemma 2, sup θ : ∥ θ ∥≤ C ∇ ϵ n ( θ ) = o µ n (1). • By Corollary 2, b θ λ n → d b θ + g λ ( b θ ), where b θ ∼ N ( θ 0 , Σ). • By Lemma 8 below, for all δ > 0 there exists a γ > 0 suc h that g λ ( θ ) is linear on S γ ( ˆ θ ) = { θ : ∥ θ − ˆ θ ∥ < γ } with probabilit y greater than 1 − δ , where ˆ θ ∼ N ( θ 0 , Σ). W e claim that the com bination of these results implies that B n → p 0 as long as ρ n → 0. T o see this, define Θ γ = { θ : g λ is linear (affine) on S γ ( θ ) } , for γ > 0. By Lemma 8, P ( b θ ∈ Θ γ ) > 1 − δ for γ small enough. By Corollary 2, and the definition of con v ergence in distribution, w e therefore get P ( b θ n ∈ Θ γ ) > 1 − δ for n large enough. By Lemma 2, ∥ b θ λ n − ˜ θ λ n ∥ < γ with probability greater than 1 − δ for n large enough. This implies that sig n ( b θ λ n ) J = η J , b ecause | ˜ θ λ n,j | is b ounded a wa y from 0 for j ∈ J , b y definition of Θ γ . This tak es care of the active coordinates. 40 Let now j ∈ J c b e one of the inactiv e co ordinates. If ˜ θ n ∈ Θ γ , then necessarily | ˜ θ j,n | + γ < λ , b y definition of Θ γ , since the mapping from ˜ θ j,n to ˜ θ λ j,n has a kink at ± λ . Let θ, θ ′ b e equal in all co ordinates except j , where θ j = t and θ ′ j = 0. Then, by Lemma 2, by | ˜ θ j,n | + γ < λ , and b y the Lipschitz con tin uity of ϵ ( θ ) with constant γ n < γ , for n large enough, whic h follo ws again from Lemma 2, ( L n ( θ ) + λ ∥ θ ∥ 1 ) − ( L n ( θ ′ ) + λ ∥ θ ′ ∥ 1 ) = 1 2 ( t − ˜ θ j,n ) 2 + ϵ ( θ ) + λ | t | − 1 2 ˜ θ 2 j,n − ϵ ( θ ′ ) ≥ 1 2 t 2 + | t | · ( − ( λ − γ ) + λ − γ n ) = 1 2 t 2 + | t | · ( γ − γ n ) ≥ 0 , with equality only for t = 0. It follo ws that the minimizer of L n ( θ ) + λ ∥ θ ∥ 1 necessarily has j th comp onent equal to zero. This takes care of the inactive co ordinates. Third sum T o control the third sum, we sho w that C n → p 0. Since P n i =1 ∥ D i n ∥ 2 = O p (1) (whic h follo ws from ∥ D i n ∥ ≤ ∥ ˆ θ λ, − i n − ˆ θ λ n ∥ + ∥ ˜ θ λ, − i n − ˜ θ λ n ∥ and the b ounds (4)–(5), b y the same argument as for (13)), this implies C n · P i ∥ D i n ∥ 2 = o p (1). Recall that C n = 1 requires: (a) sig n ( b θ λ n ) = η and g λ is linear on B ρ n ( ˜ θ n ) (the negation of B n ’s condition), but (b) there exists some i suc h that sig n ( b θ λ, − i n ) = η or sig n ( ˜ θ λ, − i n ) = η . W e show that on the ev ent describ ed in (a), neither type of sign disagreement occurs, with probabilit y approac hing 1. Pr eliminary: r ate for the influenc e-function appr oximation err or. W e claim sup λ ∥ b θ λ n − ˜ θ λ n ∥ = O µ n ( n − 1 / 2 ) = o µ n ( ρ n ) , (21) where the second equalit y uses ρ n n 1 / 2 = ( ρ n n 1 / 4 ) · n 1 / 4 → ∞ . By Lemma 2, b θ λ n and ˜ θ λ n are resp ectively the minimizers of 1 2 ∥ θ − ˜ θ n ∥ 2 + ϵ n ( θ ) + λπ ( θ ) and 1 2 ∥ θ − ˜ θ n ∥ 2 + λπ ( θ ). By Lemma 1 in Wilson et al. (2020) applied to the 41 p erturbation ϵ n , sup λ ∥ b θ λ n − ˜ θ λ n ∥ ≤ 1 µ sup ∥ θ ∥≤ C ∥∇ ϵ n ( θ ) ∥ . (22) Since ∇ L n ( θ 0 ) = θ 0 − ˜ θ n exactly (b y the definition ˜ θ n = θ 0 − P i ∇ θ l n ( θ 0 , Z i n ) in Lemma 2) and ∇ ϵ n ( θ ) = ∇ L n ( θ ) − ( θ − ˜ θ n ), we ha v e ∇ ϵ n ( θ 0 ) = 0 exactly . Therefore, for ∥ θ ∥ ≤ C , ∥∇ ϵ n ( θ ) ∥ = ∥∇ ϵ n ( θ ) − ∇ ϵ n ( θ 0 ) ∥ ≤ C · sup ∥ θ ′ ∥≤ C ∥∇ 2 ϵ n ( θ ′ ) ∥ . No w ∇ 2 ϵ n ( θ ) = ∇ 2 L n ( θ ) − I . The CL T applied to n ∇ 2 θ l n ( θ 0 , Z i n ) (finite sec- ond moments from Assumption 5(iii)) giv es ∥∇ 2 L n ( θ 0 ) − I ∥ = O µ n ( n − 1 / 2 ), using that ∥ E µ n [ n ∇ 2 θ l n ( θ 0 , Z i n )] − I ∥ = O ( n − 1 / 2 ) by Assumption 2 and the Lip- sc hitz condition in Assumption 5(iii). The same Lipschitz condition and the functional CL T extend this rate uniformly o v er b ounded neigh b orho o ds: sup ∥ θ ∥≤ C ∥∇ ϵ n ( θ ) ∥ = O µ n ( n − 1 / 2 ) = o µ n ( ρ n ) . (23) Substituting in to (22) establishes (21). Sign agr e ement for ˜ θ λ, − i n . Since ˜ θ − i n − ˜ θ n = − 1 √ n X i n where X i n = −∇ β l ( θ 0 / √ n, Z i n ), the ev ent ∥ ˜ θ − i n − ˜ θ n ∥ > ρ n for some i has probabilit y b ounded b y n X i =1 P 1 √ n ∥ X i n ∥ > ρ n ≤ n · E µ n [ ∥ X i n ∥ 4 ] ( ρ n √ n ) 4 = M ( ρ n n 1 / 4 ) 4 → 0 , b y the 4th-momen t b ound in Assumption 5(ii) and ρ n · n 1 / 4 → ∞ . On the complemen t of this even t, ˜ θ − i n ∈ B ρ n ( ˜ θ n ) for every i , so g λ is linear on a neigh b orho o d of eac h ˜ θ − i n , and therefore sig n ( ˜ θ λ, − i n ) = η for all i . Sign agr e ement for b θ λ, − i n : active c o or dinates. On the ev ent C n = 1, g λ is linear on B ρ n ( ˜ θ n ). F or j ∈ J , this forces | ˜ θ n,j | > λ + ρ n , hence | ˜ θ λ n,j | = | ˜ θ n,j | − λ > ρ n . By (21), sup λ ∥ b θ λ n − ˜ θ λ n ∥ = o µ n ( ρ n ), so with probabilit y approac hing 1, | b θ λ n,j | ≥ ρ n / 2 for all j ∈ J. (24) 42 It remains to sho w | b θ λ, − i n,j − b θ λ n,j | < ρ n / 4 for all j ∈ J and all i sim ultane- ously . By (4), sup λ ∥ b θ λ, − i n − b θ λ n ∥ ≤ 1 µ ∥∇ θ l n ( b θ n , Z i n ) ∥ . The Lipschitz condition in Assumption 5(ii) gives ∥∇ θ l n ( b θ n , Z i n ) ∥ ≤ 1 √ n ∥ X i n ∥ + B n ( Z i n ) √ n ∥ b θ n − θ 0 ∥ . A Chebyshev union b ound on the first term (4th-moment b ound in Assump- tion 5(ii)) and a Chebyshev union b ound on the second term (2nd-momen t b ound E µ n [ B 2 n ] < ∞ in Assumption 5(ii), together with b θ n − θ 0 = O µ n (1)) give P ∃ i : sup λ ∥ b θ λ, − i n − b θ λ n ∥ > ρ n 4 ≤ C 1 M ( ρ n n 1 / 4 ) 4 + C 2 E µ n [ B 2 n ] nρ 2 n → 0 , (25) for absolute constants C 1 , C 2 , using ( ρ n n 1 / 4 ) 4 → ∞ and nρ 2 n ≥ ( ρ n n 1 / 4 ) 2 · n 1 / 2 → ∞ . On the complement of (25) and giv en (24), we ha v e | b θ λ, − i n,j | ≥ ρ n / 4 > 0 with sign η j , for all j ∈ J and all i . Sign agr e ement for b θ λ, − i n : inactive c o or dinates. It remains to show that b θ λ, − i n,j = 0 for j ∈ J c and all i , with probabilit y approaching 1. The argument parallels the one used for the second sum (sho wing sig n ( b θ λ n ) = η ), now applied to L − i n and b θ λ, − i n . By Lemma 2, L − i n ( θ ) = 1 2 ∥ θ − ˜ θ − i n ∥ 2 + ϵ ( i ) n ( θ ) , where ϵ ( i ) n ( θ ) = ϵ n ( θ ) − l n ( θ , Z i n ) − l n ( θ 0 , Z i n ) − ⟨∇ θ l n ( θ 0 , Z i n ) , θ − θ 0 ⟩ + 1 2 n ( θ − θ 0 ) ⊺ n ∇ 2 θ l n ( θ 0 , Z i n ) ( θ − θ 0 ) . Consider a candidate minimizer θ of L − i n ( θ ) + λ ∥ θ ∥ 1 with θ j = t = 0 for some j ∈ J c , and let θ ′ agree with θ except θ ′ j = 0. On the even t that g λ is linear on B ρ n ( ˜ θ n ), the KKT conditions imply | ˜ θ n,j | + ρ n < λ for j ∈ J c . Therefore, 43 b y the same calculation as in the second sum, L − i n ( θ ) + λ | t | − L − i n ( θ ′ ) ≥ 1 2 t 2 + | t | λ − | ˜ θ − i n,j | − sup ∥ θ ∥≤ C ∥∇ ϵ ( i ) n ( θ ) ∥ ! . Since | ˜ θ − i n,j | ≤ | ˜ θ n,j | + n − 1 / 2 ∥ X i n ∥ and the slac k is λ − | ˜ θ n,j | > ρ n on the even t under consideration, it suffices to sho w sim ultaneously for all i : n − 1 / 2 ∥ X i n ∥ + sup ∥ θ ∥≤ C ∥∇ ϵ ( i ) n ( θ ) ∥ < ρ n . (26) W rite ∇ ϵ ( i ) n ( θ ) = ∇ ϵ n ( θ ) − R 3 ( θ , Z i n ), where R 3 ( θ , Z i n ) := ∇ θ l n ( θ , Z i n ) − ∇ θ l n ( θ 0 , Z i n ) − 1 n n ∇ 2 θ l n ( θ 0 , Z i n ) ( θ − θ 0 ) is the second-order T aylor remainder of ∇ θ l n ( · , Z i n ) around θ 0 . W e b ound the three con tributions to (26) in turn. Contr ol of ∇ ϵ n . By (23), sup ∥ θ ∥≤ C ∥∇ ϵ n ( θ ) ∥ = O µ n ( n − 1 / 2 ) = o µ n ( ρ n ); this term is the same for all i . Simultane ous c ontr ol of R 3 ( · , Z i n ) . By the Lipschitz condition in Assump- tion 5(iii), ∥ R 3 ( θ , Z i n ) ∥ ≤ C n ( Z i n ) 2 n ∥ θ − θ 0 ∥ 2 , so sup ∥ θ ∥≤ C ∥ R 3 ( θ , Z i n ) ∥ ≤ C n ( Z i n ) C 2 2 n . A Cheb yshev union b ound giv es P ∃ i : sup ∥ θ ∥≤ C ∥ R 3 ( θ , Z i n ) ∥ > ρ n 3 ! ≤ n · P C n ( Z i n ) > 2 nρ n 3 C 2 ≤ 9 C 4 sup n E µ n [ C n ( Z i n ) 2 ] 4 nρ 2 n → 0 , (27) since nρ 2 n ≥ ( ρ n n 1 / 4 ) 2 · n 1 / 2 → ∞ and sup n E µ n [ C n ( Z i n ) 2 ] < ∞ b y Assump- tion 5(iii). Simultane ous c ontr ol of n − 1 / 2 ∥ X i n ∥ . By the same Chebyshev argumen t as for the first sign-agreement step, P ∃ i : n − 1 / 2 ∥ X i n ∥ > ρ n 3 ≤ 81 M ( ρ n n 1 / 4 ) 4 → 0 . (28) On the complement of the three even ts ab o ve, (26) holds for all i : n − 1 / 2 ∥ X i n ∥ + 44 sup ∥ θ ∥≤ C ∥∇ ϵ ( i ) n ( θ ) ∥ ≤ ρ n / 3 + o µ n ( ρ n ) + ρ n / 3 < ρ n for n large enough. The slac k is therefore strictly p ositive, so b θ λ, − i n,j = 0 for all j ∈ J c and all i . Conclusion. Combining the three cases ab ov e, we conclude that on the ev en t describ ed in condition (a) (which has probability approaching 1 given B n → p 0), all signs agree: sig n ( b θ λ, − i n ) = sig n ( ˜ θ λ, − i n ) = η for every i . This con tradicts condition (b), so C n → p 0. Since P i ∥ D i n ∥ 2 = O p (1), w e obtain C n · P i ∥ D i n ∥ 2 = o p (1). Com bining the b ounds for all three sums, w e conclude D D n = o µ n (1), whic h completes the pro of of Lemma 4. C Characterizing SURE for Ridge T o prov e Lemma 5 for the Ridge p enalty 1 2 θ · A − 1 · θ , w e start b y deriving a series of prop erties of the function S U R E ( λ, ˆ θ , Σ) = trace(Σ) + ∥ g λ ∥ 2 + 2 trace ∇ g λ · Σ . The prop erties derived in this section are purely analytic, not probabilistic, and concern the b ehavior of S U RE as a function, not any asymptotic limits. Recall that g λ ( θ ) = argmin g 1 2 ∥ g ∥ 2 + λ · π ( θ + g ), and that g λ satisfies the first-order condition, g λ ( θ ) = − λ · ∇ π ( θ + g λ ( θ )) , for a suitable sub-gradien t ∇ π of π . Ridge corresp onds to p enalties of the form π ( θ ) = 1 2 θ · A − 1 · θ , where A is p ositive definite. Denote C λ = − ( 1 λ A + I ) − 1 . The first order condition for g λ ( θ ) then implies g λ ( θ ) = C λ · θ , ∇ g λ ( θ ) = C λ . and th us S U RE ( λ, R , ν ) = trace(Σ) + ∥ C λ · θ ∥ 2 + 2 trace ( C λ · Σ) . 45 The follo wing change of co ordinates will b e conv enient for some of our argumen ts. Denote R n = ∥ ˆ θ n ∥ and ν n = ˆ θ n /R n , and similarly for R and ν . In a sligh t abuse of notation, we shall write S U RE ( λ, R , ν ) = S U RE ( λ, ˆ θ , Σ) . The follo wing lemma c haracterizes the b eha vior of S U RE for Ridge. Prop- ert y 1 and sup ermo dularit y are deriv ed directly from the expression for S U RE . Prop erties 2a, 2b, and 3 are then consequences of supermo dularity . Lemma 6 (Prop erties of S U RE for Ridge) . Supp ose that π ( θ ) = 1 2 θ · A − 1 · θ , wher e A is p ositive definite. Then the fol lowing holds: 1. F or every p oint θ , the function S U RE ( λ, θ ) satisfies sup λ ∈ Λ | S U RE ( λ, θ ′ ) − S U RE ( λ, θ ) | → 0 as θ ′ → θ . 2. The function S U RE ( λ, R, ν ) is strictly sup ermo dular in λ and R . This implies: (a) λ ( R, ν ) = argmin λ ∈ R + S U RE ( λ, R , ν ) is monotonic al ly de cr e asing in R , given ν . (b) λ ( R, ν ) has at most c ountably many disc ontinuities, as a function of R , given ν . 3. Fix ν and R such that λ ( · ) is c ontinuous in R at ( R , ν ) , and let ¯ λ = λ ( R, ν ) . Then sup ermo dularity implies that the minimum of S U RE is wel l sep ar ate d: F or any ϵ > 0 , inf λ ∈ R + \ [ ¯ λ − ϵ, ¯ λ + ϵ ] S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) > 0 . Pr o of of L emma 6 (Pr op erties of S U R E for Ridge): 46 1. W e ha ve S U RE ( λ, θ ′ ) − S U RE ( λ, θ ) = ∥ C λ · θ ′ ∥ 2 − ∥ C λ · θ ∥ 2 ≤ ∥ C λ ∥ 2 · ∥ θ ′ − θ ∥ · ∥ θ ′ + θ ∥ ≤ ∥ θ ′ − θ ∥ · ∥ θ ′ + θ ∥ , where in the third line we ha ve used Cauc hy-Sc hw artz ( ∥ θ ′ ∥ 2 − ∥ θ ∥ 2 = ⟨ θ ′ − θ , θ ′ + θ ⟩ ≤ ∥ θ ′ − θ ∥ · ∥ θ ′ + θ ∥ ), and in the last line w e hav e used that A is p ositive definite, whic h implies ∥ C λ ∥ ≤ 1. The claim follo ws. 2. The first and last terms in the expression for S U R E do not dep end on θ . The middle term can b e written as R 2 · ∥ C λ · ν ∥ 2 , and thus ∂ 2 ∂ λ∂ R S U RE ( λ, R , ν ) = 2 R · ∂ ∂ λ ∥ C λ · ν ∥ 2 > 0 , using again the p ositiv e definiteness of A . This implies that S U RE ( λ, R, ν ) is strictly sup ermo dular in λ and R , i.e., whenever ϵ > 0 and δ > 0 [ S U RE ( λ + ϵ, R, ν ) − S U RE ( λ, R, ν )] − [ S U RE ( λ + ϵ, R − δ, ν ) − S U RE ( λ, R − δ, ν )] > 0 . (a) Monotonicit y of λ ( R , ν ) in R follows from sup ermo dularit y of S U RE , b y T opkis’s theorem. (b) That λ ( R, ν ) is contin uous in R almost ev erywhere holds b ecause monotonic functions are con tinuous almost ev erywhere: The set of discon tin uity p oin ts is at most coun table, b y Theorem 4.30 of Rudin (1991). 3. F or the given R, ν , let δ be suc h that | λ ( R ′ , ν ) − λ ( R, ν ) | < ϵ/ 2 whenever | R ′ − R | ≤ δ ; suc h a δ exists b y contin uity . 47 Define A ( λ 1 , λ 2 , R ) = S U R E ( λ 1 , R , ν ) − S U RE ( λ 2 , R , ν ) , let ¯ λ = λ ( R, ν ) and ∆ = min A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R ) − A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R − δ ) , A ( ¯ λ − ϵ/ 2 , ¯ λ − ϵ, R + δ ) − A ( ¯ λ − ϵ/ 2 , ¯ λ − ϵ, R ) . By strict sup ermo dularity , both of the “double differences” in this def- inition are p ositive, and th us ∆ > 0. The follo wing figure illustrates the definition of ∆. The differences defining A are tak en ov er v ertical segmen ts for differen t λ and fixed R . The double differences defining ∆ are tak en o ver of the grey rectangles in the figure: R λ R R − δ R + δ ¯ λ ¯ λ + ϵ ¯ λ + ϵ/ 2 ¯ λ − ϵ/ 2 ¯ λ − ϵ 48 W e claim that inf λ ∈ R + \ [ ¯ λ − ϵ, ¯ λ + ϵ ] S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) ≥ ∆ . The follo wing argumen t will correspond to the “top left” rectangle in the figure; the “b ottom right” case is symmetric. Fix ˜ λ > ¯ λ + ϵ . • Giv en our assumptions and definitions, ˜ λ > ¯ λ + ϵ > ¯ λ + ϵ/ 2 > λ ( R − δ ) ≥ ¯ λ . Therefore, b y sup ermo dularity A ( ˜ λ, λ ( R − δ ) , R ) − A ( ˜ λ, λ ( R − δ ) , R − δ ) ≥ A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R ) − A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R − δ ) . • By definition of ∆, A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R ) − A ( ¯ λ + ϵ, ¯ λ + ϵ/ 2 , R − δ ) ≥ ∆ . • By optimalit y of λ ( R − δ ) for R − δ , A ( ˜ λ, λ ( R − δ ) , R − δ ) ≥ 0 . • Com bining the preceding three items, w e get A ( ˜ λ, λ ( R − δ ) , R ) ≥ ∆ . and th us S U RE ( ˜ λ, R , ν ) ≥ S U R E ( λ ( R − δ ) , R, ν ) + ∆ ≥ S U RE ( ¯ λ, R , ν ) + ∆ . The argumen t for ˜ λ < ¯ λ − ϵ is analogous, and the claim follows. 49 D Characterizing SURE for Lasso Lasso corresp onds to p enalties of the form π ( θ ) = ∥ A − 1 · θ ∥ 1 , where A is an in v ertible matrix, and ∥ · ∥ 1 is the L 1 norm. 8 Giv en λ , w e can characterize g λ ( θ ) as follows. Denote h λ ( θ ) = A − 1 ( θ + g λ ( θ )). The optimal h λ ( θ ) solv es h λ ( θ ) = argmin h 1 2 ∥ A · h − θ ∥ 2 + λ · ∥ h ∥ 1 . The solution to this con v ex optimization problem is of the form h λ J ( θ ) = ( A ′ J A J ) − 1 · [ A ′ J θ − λη J ] . where η j = sig n ( h λ j ), J = { j : η j = 0 } , and A J is the subset of columns corresp onding to the index set J . These conditions follo w immediately from the first order conditions for h λ ( θ ). Lemma 7 (Prop erties of S U RE for Lasso) . Supp ose that π ( θ ) = ∥ A − 1 · θ ∥ 1 , wher e A is an invertible matrix, and ∥ · ∥ 1 is the L 1 norm. L et k = dim( θ ) . Then the fol lowing holds: 1. As a function of λ , for every R, ν , the gr aph of S U RE ( λ, R, ν ) c onsists of at most 3 k c ontinuous se gments indexe d by η ∈ {− 1 , 0 , 1 } k . On e ach of these se gments η and J = { j : η j = 0 } ar e c onstant, ∇ g λ = A J · ( A ′ J A J ) − 1 · A ′ J − I , ∥ g λ ∥ 2 = ∥∇ g λ · θ ∥ 2 + λ 2 · η ′ J ( A ′ J A J ) − 1 η J , and S U R E ( λ, R, ν ) is a monotonic al ly incr e asing quadr atic p olynomial in λ of the form S U RE ( λ, R , ν ) = const. + λ 2 · η ′ J ( A ′ J A J ) − 1 η J . 8 F or Lasso, it is not without loss of generality to assume that A is diagonal. 50 2. S U R E for L asso sc ales with R as fol lows: S U RE ( R · λ, R , ν ) = trace(Σ) + R 2 · ∥ g λ ( ν ) ∥ 2 + 2 trace ∇ g λ ( ν ) · Σ . Given ν , let λ 1 , λ 2 , . . . , λ m ( m ≤ 3 k ) b e the lo c al minimizers of S U RE ( λ, 1 , ν ) , c orr esp onding to values of λ wher e η changes. The lo c al minimizers of S U RE ( λ, R , ν ) ar e then given by R · λ 1 , R · λ 2 , . . . , R · λ m . 3. L et λ ( R, ν ) = argmin λ ∈ R + S U RE ( λ, R , ν ) . Then, given ν , λ ( R, ν ) = R · λ j ( R ) , wher e j ( R ) is a monotonic al ly de cr e asing mapping fr om R to 1 , 2 , . . . , m , and λ 1 , λ 2 , . . . , λ m ar e as b efor e. The gr aph λ ∗ ( R, ν ) thus fol lows a pie c ewise line ar “sawto oth” p attern with at most 3 k jumps. 4. Fix ν and R such that λ ( · ) is c ontinuous in R at ( R , ν ) , and let ¯ λ = λ ( R, ν ) b e such that η = 0 . Then the minimum of S U RE is wel l sep a- r ate d: F or any ϵ > 0 , inf λ ∈ R + \ [ ¯ λ − ϵ, ¯ λ + ϵ ] S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) > 0 . Pr o of of L emma 7 (Pr op erties of S U R E for L asso): 1. Consider t wo v alues λ 1 , λ 2 of λ such that η is the same for these tw o v alues. It follows from the optimalit y conditions for h λ that for an y in termediate v alue of λ b etw een λ 1 , λ 2 , the optimal h λ is a linear inter- p olation b etw een h λ 1 and h λ 2 , and in particular η remains the same in b et w een λ 1 , λ 2 . F or details, see Mairal and Y u (2012), Lemma 2. The vector η ∈ {− 1 , 0 , 1 } k can take 3 k p ossible v alues. The gradient of g λ with resp ect to θ is a function of J , which is a function of η , but it do es not dep end on λ otherwise: ∇ g λ = A · ∇ h λ − I = A J · ( A ′ J A J ) − 1 · A ′ J − I . It follo ws that the p enalty term 2 trace ∇ g λ · Σ in the expression for S U RE has at most 3 k − 1 jumps, as a function of λ for fixed θ , and is 51 constan t in betw een these jumps. 9 Consider no w the term ∥ g λ ∥ 2 , whic h is the other term in the expression for S U RE . Fixing λ , and the corresp onding set J of activ e co ordinates, w e get that ∥ g λ ∥ 2 = ∥ A J h λ J − θ ∥ 2 , whic h is minimized at λ = 0, holding J fixed. F or λ = 0 w e get ∥ A J h 0 J − θ ∥ 2 = ∥ A J (( A ′ J A J ) − 1 A ′ J − I ) · θ ∥ 2 = ∥∇ g λ · θ ∥ 2 . This is the sum of squared errors for an OLS regression of the elemen ts of θ on the ro ws of A J . Giv en J , ∥ g λ ∥ 2 is a quadratic function of λ with second deriv ative ∂ 2 λ ∥ A J h λ J − θ ∥ 2 = 2 · η ′ J ( A ′ J A J ) − 1 η J . It follows that ∥ g λ ∥ 2 = ∥∇ g λ · θ ∥ 2 + λ 2 · η ′ J ( A ′ J A J ) − 1 η J . This last result implies that ∥ g λ ∥ 2 is monotonically increasing in λ on eac h segment defined b y η . Since g λ is con tinuous in λ (cf. Lemma 2 in Mairal and Y u 2012), this also implies that ∥ g λ ∥ 2 is monotonically increasing across λ ∈ R + ; w e will use this fact b elo w. 2. Multiplying the ob jectiv e of the optimization problem h λ ( θ ) = argmin h 1 2 ∥ A · h − θ ∥ 2 + λ · ∥ h ∥ 1 b y a factor 1 /R 2 yields h λ ( R · ν ) = argmin h 1 2 ∥ A · ( h/R ) − ν ∥ 2 + λ/R · ∥ h/R ∥ 1 = R · h λ/R ( ν ) , and th us also g λ ( R · ν ) = R · g λ/R ( ν ), and ∇ g λ ( R · ν ) = 1 R · ∂ ν g λ ( R · ν ) = ∇ g λ/R ( ν ) . 9 This b ound can b e refined, cf. Mairal and Y u (2012), but it is enough for our purposes. 52 whic h immediately implies S U RE ( R · λ, R , ν ) = trace(Σ) + R 2 · ∥ g λ ( ν ) ∥ 2 + 2 trace ∇ g λ ( ν ) · Σ . T urning to the characterization of lo cal minima, since ∥ g λ ( ν ) ∥ 2 is mono- tonically increasing in λ (cf. item 1), the lo cal minima of S U R E ( R · λ, R , ν ) are exactly the v alues of λ where 2 trace ∇ g λ ( ν ) · Σ jumps do wn. These v alues are indep enden t of R , and the claim follo ws. 3. That λ ( R, ν ) = R · λ j ( R ) follo ws immediately from the preceding item. It remains to show that j ( R ) is monotonically decreasing. T o see this, consider any pair of v alues j > j ′ . Then ∥ g λ j ( ν ) ∥ 2 > ∥ g λ j ′ ( ν ) ∥ 2 b y monotonicit y of ∥ g λ ( ν ) ∥ 2 in λ (cf. item 1), and w e get that S U RE ( R · λ j , R , ν ) − S U RE ( R · λ j ′ , R , ν ) = R 2 · ∥ g λ j ( ν ) ∥ 2 − ∥ g λ j ′ ( ν ) ∥ 2 is increasing in R . The claim follows. 4. By the preceding argument, at a point of contin uity in R inf λ ∈{ R · λ 1 ,R · λ 2 ,...,R · λ m }\ ¯ λ S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) > 0 . The same holds for λ to the righ t of any of the lo cal minimizers { R · λ 1 , R · λ 2 , . . . , R · λ m }\ ¯ λ , since S U RE is monotonically increasing in λ a w ay from the local minimizers. It only remains to verify the condition for λ immediately to the righ t of the global minimizer ¯ λ . This holds b ecause ∥ g λ ∥ 2 = const. + λ 2 · η ′ J ( A ′ J A J ) − 1 η J is strictly monotonically increasing in λ for λ > 0 and η = 0. Lemma 8 (Lo cal linearit y of g λ ) . 1. F or al l γ > 0 ther e exists an ϵ > 0 such that g λ ( θ ) is line ar on B ϵ ( ˆ θ ) = { θ : ∥ θ − ˆ θ ∥ < ϵ } with pr ob ability gr e ater than 1 − γ , wher e ˆ θ ∼ N ( θ 0 , Σ) . 53 2. F or almost every p oint θ , the function S U RE ( λ, θ ) satisfies sup λ ∈ Λ | S U RE ( λ, θ ′ ) − S U RE ( λ, θ ) | → 0 as θ ′ → θ . Pr o of of L emma 8: T o sho w the first claim, denote Θ η = { θ : sig n ( h λ ( θ )) = η } . By the KKT conditions for h λ , the set Θ η is con vex for every η ∈ {− 1 , 0 , 1 } k , and its b oundary ∂ Θ η is a finite union of subsets of h yp erplanes. F urthermore R k = S η Θ η , and h λ ( θ ) is linear in θ on eac h of the sets Θ η . The claim of the lemma therefore follows if w e can sho w that the probabilit y of an ϵ -band around the boundary ∂ Θ η , ∂ Θ ϵ η = { θ : d ( θ , ∂ Θ η ) < ϵ } , where d is the Euclidean distance, has v anishing measure for small ϵ . Because ∂ Θ η is a subset of a finite union of hyperplanes, P ( ˆ θ ∈ ∂ Θ η ) = 0. By the prop erties of probability measures, since ∂ Θ η = \ ϵ> 0 ∂ Θ ϵ η w e get 0 = P ( ˆ θ ∈ ∂ Θ η ) = lim ϵ → 0 P ( ˆ θ ∈ ∂ Θ ϵ η ) . Therefore, for ϵ small enough, ˆ θ is more than ϵ a wa y from the b oundary of any of the sets Θ η with probabilit y bigger than 1 − γ , and the claim follows. W e no w turn to the second claim. Fix λ ∈ Λ. By the preceding argument, for almost all points θ , θ is in the in terior of Θ η , for some η . Th us θ ′ ∈ Θ η , as w ell, for ∥ θ ′ − θ ∥ small enough. By the c haracterization of S U R E for Lasso in item 1 of Lemma 7, giv en λ and η , ∇ g λ is constan t and S U RE = const. + 54 ∥ g λ ∥ 2 = const. + ∥∇ g λ · θ ∥ 2 . This is contin uous in θ , and thus | S U R E ( λ, θ ′ ) − S U RE ( λ, θ ) | → 0 as θ ′ → θ . Since almost everywhere con tin uit y of S U RE in θ th us holds for fixed λ , it also holds sim ultaneously for an y finite set of λ , and the claim follows. E Con v ergence of risk The remainder of our proof will dra w on some standard results, whic h w e recall here, including the following results from v an der V aart (2000): 1. Join t conv ergence (item (v) of Theorem 2.6) If W 1 n → d W 1 , and W 2 n → p 0, then W n → d W , where W n = ( W 1 n , W 2 n ), W = ( W 1 , 0). 2. Almost ev erywhere CMT (item (ii) of Theorem 2.3): Supp ose that W n → d W (con v erges in distribution), and that s ( W ) is almost ev erywhere con tin uous. Then s ( W n ) → d s ( W ). 3. Uniform in tegrabilit y and con v ergence of exp ectations (Theo- rem 2.20): Supp ose that s ( W n ) → d s ( W ), and that lim M →∞ lim sup n →∞ E [ | s ( W n ) | 1 ( | s ( W n ) | > M )] = 0 . (29) Then E [ s ( W n )] → E [ s ( W )]. E.1 Con vergence in distribution Consider some arbitrary function c ( λ ) that is minimized to c ho ose the tuning parameter λ (in due time, w e will substitute C V n for this function). Define ∆( λ ) = c ( λ ) − S U R E ( λ, θ , Σ) . 55 W e think of ∆ as an element of the space of b ounded functions on Λ ⊂ R + , endo w ed with the sup norm. Define furthermore w = ( θ , ∆) , ∥ w ∥ = ∥ θ ∥ + sup λ ∈ Λ | ∆( λ ) | W e hav e prov en for Ridge (in Lemma 6), and for Lasso (in Lemma 7), that the minimum of S U R E with resp ect to λ is w ell separated for almost all θ (with the exception of Lasso when λ is so large that ˆ θ λ = 0). This implies the follo wing lemma. The “min” in the definition of ˜ λ serves as a tie-breaking rule in the case of non-uniqueness of the minimizer. Lemma 9. F or almost every θ , the mapping fr om w = ( θ , ∆) to g ˜ λ ( θ, ∆) , wher e ˜ λ ( θ , ∆) = min argmin λ ∈ Λ [ S U RE ( λ, θ , Σ) + ∆( λ )] , is c ontinuous at w = ( θ , 0) with r esp e ct to the norm ∥ w ∥ . Pr o of. W e first prov e the claim for Ridge, b efore discussing the necessary mod- ifications for the argument to apply to Lasso. Fix ν . By item 2a of Lemma 6, for almost every R , λ ( R , ν ) is contin uous in R . Fix suc h an R , and let θ = Rν . Fix ϵ > 0 and let γ = inf λ ∈ R + \ [ ¯ λ − ϵ, ¯ λ + ϵ ] S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) > 0 , where ¯ λ = ˜ λ ( θ , 0) ∈ argmin λ S U RE ( λ, R , ν ). The follo wing figure illustrates the definition of γ : λ SURE γ ¯ λ ¯ λ − ϵ ¯ λ + ϵ S U RE ( λ, R , ν ) 56 By item 3 of Lemma 6, γ > 0. By item 1 of Lemma 6, there exists a δ such that if ∥ θ ′ − θ ∥ < δ then sup λ ∈ R + | S U RE ( λ, θ ′ ) − S U RE ( λ, θ ) | < γ / 4. Let w ′ = ( θ ′ , ∆) and w = ( θ , 0). Supp ose that ∥ w ′ − w ∥ < min( δ , γ / 4), so that ∥ θ ′ − θ ∥ < δ and sup λ | ∆( λ ) | < γ / 4. Denote c ( λ ) = S U R E ( λ, θ ′ ) + ∆( λ ), so that ˜ λ ( θ ′ , ∆) is a minimizer of c ( λ ). Then S U RE ( ˜ λ ( θ ′ , ∆) , θ ) < S U R E ( ˜ λ ( θ ′ , ∆) , θ ′ ) + 1 4 γ < c ( ˜ λ ( θ ′ , ∆)) + 1 2 γ ≤ c ( ¯ λ ) + 1 2 γ < S U R E ( ¯ λ, θ ′ ) + 3 4 γ < S U R E ( ¯ λ, θ ) + γ . It follo ws that | ˜ λ ( θ ′ , ∆) − ¯ λ | < ϵ . This prov es that ˜ λ ( θ , ∆) is con tinuous at ( θ , 0) The claim for Ridge follows, since contin uity of g λ ( θ ) = ( 1 λ A + I ) − 1 · θ in b oth λ and θ is immediate. T urning to Lasso, most of this argumen t holds v erbatim, with the follo wing mo difications: Consider first v alues of θ suc h that ˆ θ ˜ λ = 0. 1. By item 4 of Lemma 7, γ > 0 for almost all such θ . 2. By item 2 of Lemma 8, for almost all θ there exists a δ such that if ∥ θ ′ − θ ∥ < δ then sup λ ∈ Λ | S U RE ( λ, θ ′ ) − S U RE ( λ, θ ) | < γ / 4. 3. By the characterization of g λ and h λ giv en at the outset of App endix D, g λ ( θ ) is contin uous in b oth λ and θ . The claim th us follo ws for θ suc h that ˆ θ ˜ λ = 0. Consider no w v alues of θ such that ˆ θ ˜ λ = 0. F or suc h v alues, S U RE is flat in λ for v alues of λ greater than ˜ λ , b ecause ˆ θ λ = 0 and ∇ g λ = 0 for all suc h λ (see Figure 2 for an example). Because S U RE is flat to the right, con tinuit y 57 of ˜ λ ( θ , ∆) do es not necessarily hold at ( θ, 0); small perturbations of ∆ can lead to large changes of ˜ λ . By the same arguments used to pro v e item 4 of Lemma 7, w e obtain ho w ever (for almost all such θ ) that inf λ<λ m S U RE ( λ, R , ν ) − S U RE ( ¯ λ, R , ν ) > 0 , where λ m is defined as in Lemma 7. This, in com bination with item 2 of Lemma 8 (for almost all θ , sup λ ∈ Λ | S U RE ( λ, θ ′ ) − S U R E ( λ, θ ) | → 0 as θ ′ → θ ), implies that ˜ λ ( θ ′ , ∆) is such that g ˜ λ ( θ, 0) ( θ ′ ) = − θ ′ for all ( θ ′ , ∆) in a neigh b or- ho o d of ( θ , 0), and the claim follows. E.2 Pro of of conv ergence in distribution W e can no w pro ve Lemma 5, dra wing on our preceding Lemmas. Pr o of of L emma 5: • By Lemma 2, ˆ θ n → d ˆ θ ∼ N ( θ 0 , Σ) . • Let ∆ n ( λ ) = C V n ( λ ) − S U RE ( λ, ˆ θ n , Σ). By Lemma 4, sup λ ∈ Λ | ∆ n ( λ ) | → p 0 . • By joint con vergence (v an der V aart 2000, item (v) of Theorem 2.6), W n = ( ˆ θ n , ∆ n ) → d W = ( ˆ θ , 0). • By Lemma 9, the mapping from W n to ˆ θ n + g ˜ λ ( ˆ θ n , ∆ n ) ( ˆ θ n ) is almost surely con tin uous on the supp ort of W . • By definition, ˜ λ ( ˆ θ n , ∆ n ) = λ ∗ n = argmin λ C V n ( λ ), and ˜ λ ( ˆ θ , 0) = λ ∗ = argmin λ S U RE ( λ, ˆ θ , Σ). 58 • The almost surely con tinuous mapping theorem (v an der V aart 2000, Theorem 2.3) then implies ˆ θ n + g λ ∗ n ( ˆ θ n ) → d ˆ θ + g λ ∗ ( ˆ θ ) = ˆ θ λ ∗ . • By Lemma 2 ˆ θ ∗ n = ˆ θ n + g λ ∗ n ( ˆ θ n ) + o p (1), and the claim follo ws. E.3 Con vergence of loss and risk Pr o of of The or em 1: By Lemma 3, ¯ L n ( θ , θ 0 ) → 1 2 ∥ θ − θ 0 ∥ 2 uniformly in any bounded neigh b orho o d of θ 0 . By Lemma 5, ˆ θ ∗ n → d ˆ θ ∗ . Com bining these t wo results giv es ¯ L n ( ˆ θ ∗ n , θ 0 ) → d 1 2 ∥ ˆ θ ∗ − θ 0 ∥ 2 . The distributional conv ergence claim of Theorem 1 follo ws. The claim of Corollary 1 is then immediate. 59
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment