Computationally tractable nonparametric bootstrap of high-dimensional sample covariance matrices

Computationally tractable nonparametric bootstrap of high-dimensional sample covariance matrices
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce a new ``$(m,mp/n)$ out of $(n,p)$’’ sampling-with-replace-ment bootstrap for eigenvalue statistics of high-dimensional sample covariance matrices based on $n$ independent $p$-dimensional random vectors. As it only uses $q=\lfloor mp/n\rfloor $ coordinates of the observations in a subsample of size $m \ll n $ from the original data, it is computationally tractable for large scale data. In the high-dimensional scenario $p/n\rightarrow c\in (0,\infty)$, this fully nonparametric bootstrap is shown to consistently reproduce the empirical spectral measure if $m/n\rightarrow 0$. If $m^2/n\rightarrow 0$, it approximates correctly the distribution of linear spectral statistics. The crucial component is a suitably defined Representative Subpopulation Condition which is shown to be verified in a large variety of situations. Our proofs are conducted under minimal moment requirements and incorporate delicate results on non-centered quadratic forms, combinatorial trace moments estimates as well as a conditional bootstrap martingale CLT which may be of independent interest.


💡 Research Summary

The paper addresses a fundamental challenge in high‑dimensional statistics: how to bootstrap eigenvalue‑based functionals of a sample covariance matrix when both the dimension p and the sample size n grow proportionally (p/n → c ∈ (0,∞)). Classical non‑parametric bootstrap, which resamples n observations with replacement and recomputes the full p × p covariance matrix, is both computationally prohibitive (O(np² + p³) per replication) and statistically inconsistent in this regime—its limiting spectral distribution (LSD) differs from that of the original matrix, as shown by recent work of El Karoui and Purdom.

To overcome these issues, the authors propose a novel “(m, mp/n) out of (n, p)” bootstrap. One first draws a subsample of size m ≪ n (with replacement) from the original data. Then, for each selected observation, only q = ⌊mp/n⌋ coordinates are randomly chosen, so that the ratio q/m matches the original aspect ratio p/n. The bootstrap covariance matrix is formed from these reduced‑dimension observations. Because the effective dimension‑to‑sample‑size ratio is preserved, the Marčenko–Pastur equation governing the LSD remains valid for the bootstrap sample.

The theoretical contribution rests on two main results. (1) Consistency of the empirical spectral measure: if m/n → 0, the empirical spectral distribution of the bootstrap covariance matrix converges weakly to the same limit as that of the original matrix. (2) Central limit theorem for linear spectral statistics (LSS): if m²/n → 0, any linear statistic Lₙ = ∑_{i=1}^p f(λ̂_i) with a sufficiently smooth test function f satisfies a conditional bootstrap martingale CLT, yielding the same Gaussian limit as the original LSS. The required assumptions are minimal: the data follow the standard random‑matrix model Y_i = A_n X_i with i.i.d. entries of zero mean, unit variance, and a finite fourth moment (for the LSS CLT). No high‑order moment or explicit knowledge of the population spectral distribution is needed.

A key device is the Representative Subpopulation Condition (RSC). This condition asserts that the randomly selected q‑dimensional subvector of each observation has a covariance matrix whose spectral distribution approximates that of the full p‑dimensional covariance Σ_n. The authors prove that RSC holds in a wide variety of practically relevant settings, including (i) diagonal Σ_n, (ii) block‑diagonal structures, (iii) low‑rank plus noise models, and (iv) situations where the eigenvalues of Σ_n follow a stable empirical distribution. Thus the method is fully non‑parametric and does not require estimating Σ_n or its eigenvalues.

The proofs combine several sophisticated tools: (i) non‑centered quadratic form bounds, (ii) combinatorial trace‑moment estimates to control spectral moments of the reduced matrices, and (iii) a conditional bootstrap martingale CLT that treats the bootstrap resampling as a martingale difference sequence. The analysis shows that the randomness introduced by the coordinate selection does not disturb the limiting Stieltjes transform, while the subsampling size m controls the bias‑variance trade‑off: m = o(n) guarantees LSD consistency, and m = o(√n) guarantees LSS consistency.

Computationally, each bootstrap replication requires O(m q² + q³) operations, which, because q ≈ mp/n, scales essentially linearly in the original data size (≈ O(mp)). In the authors’ simulations (n = 80 000, p = 40 000, m = 8 000, q = 4 000) the proposed bootstrap reproduces the Marčenko–Pastur density far more accurately than the classical bootstrap, and the empirical distribution of LSS (trace, log‑determinant) matches the theoretical Gaussian limit.

In summary, the paper delivers a computationally tractable, fully non‑parametric bootstrap method for high‑dimensional covariance matrices. By preserving the aspect ratio through coordinated subsampling and establishing the Representative Subpopulation Condition, it achieves both statistical consistency (for LSD and LSS) and dramatic reductions in computational cost. The approach opens the door to reliable bootstrap inference in modern big‑data applications such as high‑dimensional PCA, factor analysis, and covariance‑based hypothesis testing, without imposing restrictive moment or structural assumptions on the underlying population.


Comments & Academic Discussion

Loading comments...

Leave a Comment