Asymptotic Theory of $K$-fold Cross-validation in Lasso and the validity of Bootstrap
Least absolute shrinkage and selection operator or Lasso is one of the widely used regularization methods in regression. Statisticians usually implement Lasso in practice by choosing the penalty parameter in a data-dependent way, the most popular being the $K-$fold cross-validation (or $K-$fold CV). However, inferential properties, such as the variable selection consistency and $n^{1/2}-$consistency, of the $K-$fold CV based Lasso estimator and validity of the Bootstrap approximation are still unknown. In this paper, we consider the heteroscedastic linear regression model and show only under some moment type conditions that the Lasso estimator with $K$-fold CV based penalty is $n^{1/2}-$consistent, but not variable selection consistent. Additionally, we establish the validity of Bootstrap in approximating the distribution of the $K-$fold CV based Lasso estimator. Therefore, our results theoretically justify the use of $K-$fold CV based Lasso estimator to perform statistical inference in linear regression. We validate our Bootstrap method for the $K-$fold CV based Lasso estimator in finite samples based on simulations. We also implement our Bootstrap based inference on a real data set.
💡 Research Summary
**
This paper investigates the asymptotic properties of the Lasso estimator when its penalty parameter is chosen by K‑fold cross‑validation (CV), and it establishes the validity of a bootstrap procedure for inference in this setting.
Model and notation
The authors consider a heteroscedastic linear regression model
(y_i = x_i^\top\beta + \varepsilon_i,; i=1,\dots,n),
with fixed, full‑rank design matrix (X) and independent but not necessarily identically distributed errors (\varepsilon_i) satisfying (E(\varepsilon_i)=0). The Lasso estimator with penalty (\lambda>0) is defined as
(\hat\beta_n(\lambda)=\arg\min_{\beta}{(2n)^{-1}\sum_{i=1}^n (y_i-x_i^\top\beta)^2+\lambda|\beta|1}).
For a given integer (K), the data are split into (K) equally sized folds (I_1,\dots,I_K). For each fold (k), the Lasso is fitted on the training set (all observations except those in (I_k)), yielding (\hat\beta{n}^{-k}(\lambda)). The K‑fold CV criterion is
(H_{n,K}(\lambda)=\frac{1}{2K}\sum_{k=1}^K\sum_{i\in I_k}(y_i-x_i^\top\hat\beta_{n}^{-k}(\lambda))^2).
The CV‑selected penalty (\hat\lambda_{n,K}) minimizes (H_{n,K}(\lambda)). The object of study is the CV‑based Lasso estimator (\hat\beta_n(\hat\lambda_{n,K})).
Variable selection consistency (VSC) fails
Classical results (Knight & Fu 2000; Zhao & Yu 2006) show that Lasso can achieve VSC only if the penalty diverges faster than (n^{1/2}). The authors prove (Theorem 3.1) that (\hat\lambda_{n,K}=o_p(n^{-1})) and that the scaled sequence ({n^{-1/2}\hat\lambda_{n,K}}) is bounded in probability (Equation 1.5). By invoking Lahiri (2021), who demonstrated that divergence of ({n^{-1/2}\lambda_n}) is necessary for VSC, they conclude that the CV‑based Lasso cannot be variable‑selection consistent. This resolves a long‑standing practical question: K‑fold CV optimizes predictive performance but does not guarantee correct sparsity recovery.
(n^{1/2})‑consistency and limiting distribution
The next step is to understand whether (\hat\beta_n(\hat\lambda_{n,K})) is (n^{1/2})‑consistent. The authors first establish stochastic equicontinuity of the Lasso solution path as a function of (\lambda) (Proposition 4.1), strengthening earlier continuity results (Efron et al. 2004; Tibshirani & Taylor 2011). Using Dudley’s almost‑sure representation theorem and a uniform convergence argument (Newe 1991), they show that
(n^{-1/2}\hat\lambda_{n,K}\xrightarrow{d}\hat\Lambda_{\infty,K}) (Equation 1.6), where (\hat\Lambda_{\infty,K}) is a random variable defined in Table 1 and depends on the limiting Gaussian process (W_\infty\sim N(0,S)) and the limiting design matrices.
With this convergence in hand, they apply a new Argmin theorem for convex stochastic processes (Choudhury & Das 2024) to obtain the asymptotic distribution of the centered‑scaled estimator:
(n^{1/2}\bigl(\hat\beta_n(\hat\lambda_{n,K})-\beta\bigr)\xrightarrow{d}\arg\min_{u} V_\infty(u,\hat\Lambda_{\infty,K})) (Equation 1.7).
The limiting objective function is
(V_\infty(u,\lambda)=\frac12 u^\top L u - u^\top W_\infty + \lambda\sum_{j\le p_0}\operatorname{sgn}(\beta_j)u_j + \lambda\sum_{j>p_0}|u_j|),
where (p_0) denotes the number of true zero coefficients. This distribution is generally non‑Gaussian and depends on the random limit (\hat\Lambda_{\infty,K}), making direct analytical inference infeasible.
Bootstrap approximation
To overcome the intractability of the limiting law, the authors propose a perturbation bootstrap (Das & Lahiri 2019) adapted to the CV setting. For each bootstrap replication, independent weights (G_i) (e.g., Exp(1)) are drawn and pseudo‑responses (y_i^=G_i y_i) are formed. The same K‑fold partition is used, and the CV criterion is recomputed to obtain a bootstrap penalty (\hat\lambda_{n,K}^) and estimator (\hat\beta_n^(\hat\lambda_{n,K}^)). Theorem 5.1 proves that the bootstrap pivot
(\sqrt{n}\bigl(\hat\beta_n^(\hat\lambda_{n,K}^)-\tilde\beta_n(\hat\lambda_{n,K})\bigr))
converges in distribution (conditionally on the data) to the same limit as the original estimator, where (\tilde\beta_n(\hat\lambda_{n,K})) is a linear approximation of (\hat\beta_n(\hat\lambda_{n,K})). The proof mirrors the earlier arguments: stochastic equicontinuity of the solution path, Dudley’s representation, and the Argmin theorem are all re‑established for the bootstrap world. Consequently, confidence intervals and hypothesis tests constructed from the bootstrap quantiles are asymptotically valid.
Simulation and real data study
The authors conduct extensive Monte‑Carlo experiments varying signal‑to‑noise ratios, sparsity levels, and heteroscedastic error structures. Results show that bootstrap confidence intervals achieve nominal coverage (≈95 %) across settings, while naive normal approximations based on the Lasso’s subgradient conditions severely under‑cover. Variable selection performance confirms the theoretical finding: the CV‑based Lasso frequently omits true non‑zero coefficients, reflecting the lack of VSC.
A real‑data illustration (a genomics data set with (p=30) predictors) demonstrates the practical utility of the bootstrap method. The bootstrap intervals are wider than those obtained from standard asymptotic formulas, reflecting the additional uncertainty introduced by data‑dependent penalty selection. The authors argue that this more honest quantification of uncertainty is essential for reliable scientific conclusions.
Contributions and significance
- Theoretical clarification – The paper rigorously shows that K‑fold CV yields a penalty that guarantees (n^{1/2})‑consistency but precludes variable‑selection consistency. This resolves a gap between the widespread empirical use of CV‑selected Lasso and the theoretical literature that traditionally assumes a deterministic penalty.
- Novel asymptotic distribution – By establishing stochastic equicontinuity of the Lasso path and applying a new Argmin theorem, the authors derive a non‑standard limiting distribution for the CV‑based estimator.
- Bootstrap validity – The work provides the first proof of bootstrap consistency for a Lasso estimator with a data‑dependent penalty, extending perturbation bootstrap theory to a setting where the penalty itself must be recomputed in each bootstrap replication.
- Practical guidance – Simulation and real‑data results illustrate that bootstrap‑based inference is feasible and reliable, offering practitioners a concrete tool for constructing confidence intervals when using CV‑selected Lasso.
In summary, the paper bridges the methodological divide between predictive tuning (K‑fold CV) and inferential rigor (asymptotic theory and bootstrap), delivering both deep theoretical insights and actionable procedures for modern high‑dimensional regression analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment