New explanations and inference for least angle regression
Efron et al. (2004) introduced least angle regression (LAR) as an algorithm for linear predictions, intended as an alternative to forward selection with connections to penalized regression. However, LAR has remained somewhat of a “black box,” where some basic behavioral properties of LAR output are not well understood, including an appropriate termination point for the algorithm. We provide a novel framework for inference with LAR, which also allows LAR to be understood from new perspectives with several newly developed mathematical properties. The LAR algorithm at a data level can viewed as estimating a population counterpart “path” that organizes a response mean along regressor variables which are ordered according to a decreasing series of population “correlation” parameters; such parameters are shown to have meaningful interpretations for explaining variable contributions whereby zero correlations denote unimportant variables. In the output of LAR, estimates of all non-zero population correlations turn out to have independent normal distributions for use in inference, while estimates of zero-valued population correlations have a certain non-normal joint distribution. These properties help to provide a formal rule for stopping the LAR algorithm. While the standard bootstrap for regression can fail for LAR, a modified bootstrap provides a practical and formally justified tool for interpreting the entrance of variables and quantifying uncertainty in estimation. The LAR inference method is studied through simulation and illustrated with data examples.
💡 Research Summary
Efron et al. (2004) introduced Least Angle Regression (LAR) as an algorithm that moves the fitted response toward the predictor space in a piece‑wise linear fashion, offering an alternative to forward selection and a bridge to Lasso and forward stagewise regression. Despite its popularity, fundamental aspects of LAR have remained opaque: the algorithm’s intrinsic stopping rule, the statistical meaning of the “correlations” it reports, and how to conduct valid inference on its output. Gregory and Nordman address these gaps by developing a population‑level view of LAR, deriving exact distributional results for the quantities LAR estimates, and proposing a modified bootstrap that respects LAR’s unique geometry.
The authors first define a population counterpart Lar(X, µ), where µ is the true mean response. In this setting the algorithm proceeds exactly as the data‑level version, but now the step‑wise inner products between the design columns and the current residual are genuine population correlations. They call these step correlations C₁, C₂, … and show that, under a “prototypical” path in which a single variable enters at each step, the sequence satisfies C₁ > C₂ > … > C_m > 0, C_{m+1}=0. When C_{m+1}=0 the algorithm can stop because all remaining variables have zero population correlation with the residual and therefore contribute nothing to µ. This provides a formal, data‑driven stopping rule that replaces the ad‑hoc practice of running LAR until all p variables are active.
The central theoretical contribution is a pair of distributional results. For every non‑zero population step correlation, the LAR estimate of that correlation is asymptotically normal and, crucially, independent across different steps. In contrast, estimates of zero‑valued population correlations form a non‑normal joint distribution that cannot be factorised into independent components. These results are proved by expressing the LAR updates as orthogonal projections (a Gram‑Schmidt‑type decomposition) and exploiting the geometry of the equi‑angular direction a_k that the algorithm follows at each step. The independence of the non‑zero estimates enables straightforward construction of confidence intervals for the step correlations and for the contribution coefficients (the β‑type weights) associated with each active variable.
A further practical problem tackled in the paper is that the ordinary residual bootstrap fails for LAR. Standard bootstrap draws of (X, y) do not preserve the delicate ordering of variables and the non‑normal joint law of the zero‑correlation estimates, leading to severely biased inference. The authors therefore propose a “modified bootstrap”: residuals from the fitted LAR model are resampled, a new response is formed, and the entire LAR procedure is rerun on each bootstrap replicate. This procedure respects the algorithm’s variable‑entry order and yields bootstrap replicates of the step correlations that accurately reflect their true sampling distribution. Theoretical justification is provided (showing consistency and asymptotic validity), and extensive simulations demonstrate that the modified bootstrap attains nominal coverage for both non‑zero and zero correlation parameters, whereas the naïve bootstrap under‑covers dramatically.
Simulation studies cover a range of design matrices (varying collinearity) and signal‑to‑noise ratios in the p < n regime. Results confirm that the normal approximation for non‑zero step correlations is accurate, and that the modified bootstrap confidence intervals have correct empirical coverage. Real‑data applications (including a genomics expression dataset and an economic forecasting example) illustrate how the new inference framework yields interpretable variable‑entry confidence intervals and helps decide when to stop the algorithm, often before all predictors are entered.
The paper also discusses technical subtleties such as “mid‑path ties” (situations where multiple variables become active simultaneously) and notes that while these events have probability zero under continuous responses, they require special handling in theory. Finally, the authors outline how their population‑level perspective and bootstrap methodology could be extended to high‑dimensional settings (p > n), where LAR’s geometry becomes more complex but the same principles should apply.
In sum, Gregory and Nordman re‑conceptualize LAR as a statistical estimator of a population correlation path, derive exact asymptotic distributions for its key outputs, provide a principled stopping rule, and introduce a robust bootstrap scheme for inference. Their work fills a long‑standing methodological void, turning LAR from a heuristic variable‑selection tool into a rigorously inferential procedure.
Comments & Academic Discussion
Loading comments...
Leave a Comment