Adaptive and Stratified Subsampling for High-Dimensional Robust Estimation
We study robust high-dimensional sparse regression under finite-variance heavy-tailed noise, epsilon-contamination, and alpha-mixing dependence via two subsampling estimators: Adaptive Importance Sampling (AIS) and Stratified Sub-sampling (SS). Under sub-Gaussian design whose scopeis precisely delimited and finite-variance noise, a subsample of size m achieves the minimax-optimal rate. We close the theory-algorithm gap: Theorem 4.6 applies to AIS at termination conditional on stabilized weights (Proposition 4.1), and SS fits the median-of-means M-estimation framework of Lecue and Lerasle (Proposition 4.3). The de-biasing step is fully specified via the nodewise-Lasso precision estimator under a new sparse-precision assumption, yielding valid coordinate-wise CIs (Theorem 4.14). The alpha-mixing extension uses a calendar-time block protocol that guarantees temporal separation (Theorem 4.12). Empirically, AIS achieves 3.10 times lower error than uniform subsampling at 20% contamination, and 29.5% lower test MSE on Riboflavin (p=4,088 and n=71).
💡 Research Summary
The paper tackles the challenging setting of high‑dimensional sparse linear regression (p≫n) where the data are contaminated by heavy‑tailed noise with finite variance, an ε‑contamination adversarial component, and temporal dependence modeled by α‑mixing. Classical full‑sample robust estimators become computationally prohibitive in this regime, motivating the authors to develop two subsampling‑based estimators that retain statistical optimality while drastically reducing computational cost.
Algorithms
-
Adaptive Importance Sampling (AIS) – Starting from uniform weights, AIS iteratively draws a subsample of size m according to the current probability vector w^{(t‑1)}. On the drawn points it solves a weighted Huber‑Lasso problem, then updates the sampling probabilities via an exponential tilt of the loss values (parameter β) and a stabilization step (parameter α) that guarantees every observation retains a probability at least α/n. After T iterations the final estimate θ̂_m is returned. The algorithm’s complexity is O(T·np+T·mp).
-
Stratified Subsampling (SS) – Each observation is assigned a Mahalanobis‑type distance from the coordinate‑wise median. The distances are partitioned into K quantile‑based strata; from each stratum a proportional number of points m_k≈m·|S_k|/n is drawn uniformly, a Huber‑Lasso is fitted on each stratum, and the K resulting estimates are aggregated by a geometric median (geomed). The overall cost is O(np+mK).
Theoretical Contributions
- Under sub‑Gaussian design (Assumption 1), restricted eigenvalue (Assumption 2), finite‑variance noise (Assumption 3), and bounded sampling probabilities (Assumption 4), Lemma 4.4 establishes a uniform score bound for the weighted empirical gradient, while Lemma 4.5 proves a restricted strong convexity (RSC) condition for the weighted loss when the subsample size satisfies m≥C·(C_0/c_0)^2 s log(p/δ).
- Theorem 4.6 shows that with λ tuned as 4τKc_0√(log p/δ)/m, the weighted subsampled Huber‑Lasso attains the error bound ‖θ̂_m,q−θ*‖_2 ≤ C·τKc_0 √(s log p/m) with probability 1−2δ. Consequently, a subsample of size m=Ω(s log p) reaches the minimax‑optimal rate O(ps log p/m).
- Corollary 4.8 provides a proximity result between the subsampled and full‑sample estimators, and Theorem 4.9 supplies a matching minimax lower bound under Gaussian design, confirming optimality up to logarithmic factors.
- Proposition 4.1 bridges AIS to the theoretical framework by proving that, after the stabilization step, the final weight vector q(T) satisfies Assumption 4, so all subsequent theorems apply directly to AIS.
- Proposition 4.3 demonstrates that SS is a special case of the median‑of‑means (MOM) sparse M‑estimator of Lecué & Lerasle (2020). When the number of strata K=O(s log p), the error bound for SS matches that of AIS (Theorem 4.6). The geometric median aggregation tolerates up to ⌊(K−1)/2⌋ corrupted strata, but the authors note that very small stratum sizes (e.g., n_k≤5 in the Riboflavin data) violate the MOM assumptions and cause performance collapse.
- Theorem 4.10 extends the analysis to an ε‑contamination model (mixture (1−ε)P+εQ). The error decomposes into the usual statistical term plus an O(ε) bias term that cannot be eliminated for bounded‑influence estimators. AIS reduces this bias dramatically because the adaptive weights down‑weight corrupted points, as confirmed empirically (error grows roughly as 1.3ε versus 6.9ε for uniform sampling).
- For temporally dependent data, Theorem 4.12 introduces a calendar‑time block protocol: blocks of length B are retained, the following B observations are discarded, guaranteeing at least B calendar steps between retained blocks. Using Berbee‑Yu coupling, the retained blocks are shown to be approximately independent, allowing the same concentration arguments as in the i.i.d. case. The resulting error bound scales with the effective number of retained blocks M≈m/(2B).
Debiasing and Inference
- A new sparse‑precision assumption (Assumption 5) posits that the inverse covariance Ω=Σ⁻¹ is s_0‑sparse, has bounded ℓ_1 norm, and satisfies the irrepresentable condition for the nodewise Lasso.
- The nodewise Lasso is applied to the scaled design ˜x_j = x_j/√(nq_j) with tuning μ≈p log p/m, yielding an estimator Θ̂ of the precision matrix.
- The debiased estimator is defined as θ̂_d = θ̂_m,q − Θ̂∇L̂_m,q(θ̂_m,q). Theorem 4.14 proves that, under s log p = o(√m) and s_0 log p = o(m), Θ̂ converges in sup‑norm at rate O_p(p log p/m) and each active coordinate satisfies √m(θ̂_{d,j}−θ*_j) ⇒ N(0,σ_j²) with σ_j² =
Comments & Academic Discussion
Loading comments...
Leave a Comment