Beyond Low Rank: Fast Low-Rank + Diagonal Decomposition with a Spectral Approach
Low-rank plus diagonal (LRPD) decompositions provide a powerful structural model for large covariance matrices, simultaneously capturing global shared factors and localized corrections that arise in covariance estimation, factor analysis, and large-scale kernel learning. We introduce an alternating low-rank then diagonal (Alt) algorithm that provably reduces approximation error and significantly outperforms gradient descent while remaining cheaper than majorization-minimization methods~\cite{sun2016majorization}. To scale to large matrices, we develop a randomized LRPD variant that combines fixed-rank Nystrom sketching~\cite{tropp2017fixed} for the low-rank component with Diag++ stochastic diagonal estimation~\cite{baston2022stochastic}. This hybrid algorithm achieves machine precision decomposition error using a number of matrix-vector products far smaller than the ambient dimension, and comes with rigorous non-asymptotic error bounds. On synthetic data, it exactly recovers LRPD structured matrices with high efficiency, and on real-world S&P 500 stock return covariances, where the spectrum decays slowly and strong sector structure exists, it achieves substantially lower error than pure low-rank approximations.
💡 Research Summary
The paper tackles the problem of approximating large symmetric positive‑semidefinite matrices—most notably covariance matrices—by a sum of a low‑rank component and a diagonal correction (LRPD). While low‑rank approximations capture global shared variation, the diagonal term accounts for variable‑specific variance that cannot be explained by a few latent factors. Existing approaches such as Stein’s scalar‑multiple diagonal, majorization‑minimization, and convex relaxations either assume restrictive diagonal structures or involve expensive iterative schemes that lack strong convergence guarantees.
The authors first present a spectral construction: given Σ=VΛVᵀ, they keep the top‑k eigenpairs, form Uₖ=VₖΛₖ^{1/2}, and define a diagonal matrix Dₖ=diag(Σ−UₖUₖᵀ). They prove (Proposition 2.1) that the residual Rₖ=Σ−UₖUₖᵀ−Dₖ has operator norm at most λ_{k+1}, strictly smaller when λ_{k+1}>0. This shows that a naïve diagonal subtraction (i.e., D=diag(Σ)) can destroy positive‑semidefiniteness, whereas the proposed Dₖ is always non‑negative and yields a provably better approximation.
Motivated by this insight, they introduce the Alternating Low‑Rank‑then‑Diagonal (Alt) algorithm. Each iteration consists of: (i) fixing the current diagonal D, computing the residual R=A−D, and extracting the top‑k eigenvectors of R (the Eckart‑Young‑Mirsky theorem guarantees this is the optimal rank‑k approximation in both Frobenius and spectral norms); (ii) updating the diagonal by setting D=diag(A−U Uᵀ). The authors prove monotonic decrease of the Frobenius objective and, more importantly, a spectral‑gap‑free contraction bound on the diagonal error ∥Δₜ∥₂ (Theorem 2.2). The proof leverages Weyl’s inequalities and the Davis‑Kahan sin Θ theorem, showing that even when eigenvalues are clustered the algorithm still contracts the diagonal error. Remark 2.2 highlights that, unlike gradient descent on the factor U (which suffers from rotational degeneracy), Alt directly computes the eigenspace, making the low‑rank factor uniquely determined up to eigenvector ordering.
To scale the method to settings where only matrix‑vector products are feasible, the authors develop a randomized variant. The low‑rank part is approximated via a fixed‑rank Nyström sketch: they sample ℓ columns, form C=AΩ, compute a small SVD of W=CᵀC, and obtain an approximate Û. For the diagonal, they employ Diag++, a stochastic estimator that uses Hutchinson‑type trace probes to estimate diag(A−ÛÛᵀ) with unbiasedness and controllable variance. The combined algorithm requires O(kℓ) matrix‑vector products and O(kℓ) memory, yet achieves machine‑precision error under modest ℓ (often ℓ≈2k). Non‑asymptotic error bounds are derived by separately bounding the Nyström approximation error (which decays as O(√(k/ℓ))) and the Diag++ variance (O(1/√s) with s probes), then adding them via triangle inequality.
Experimental validation proceeds in two stages. First, synthetic matrices of size n=150 with true rank k=5 and random diagonal entries in
Comments & Academic Discussion
Loading comments...
Leave a Comment