A Random Matrix Theory of Masked Self-Supervised Regression

A Random Matrix Theory of Masked Self-Supervised Regression
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

In the era of transformer models, masked self-supervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik–Ben Arous–Péché (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods


💡 Research Summary

This paper provides a rigorous high‑dimensional analysis of masked self‑supervised learning (SSL) in its simplest linear form, which the authors call masked self‑supervised regression (SSR). The setting is as follows: given n real‑valued sequences of length d (the “tokens”), a model is trained to predict each coordinate k from the remaining d − 1 coordinates using ridge regression with regularization λ, while explicitly forbidding the use of the target coordinate itself (a_k,k = 0). Collecting the d optimal coefficient vectors â_k as columns yields a d × d matrix ˆA, the “self‑supervised ridge matrix”. By construction diag(ˆA)=0, so ˆA can be interpreted as a learned attention map that captures how each token conditions on all others.

The first technical contribution is a compact closed‑form expression for ˆA:  ˆA = I − Q(λ) Λ, where Q(λ) = (Σ̂ + λI)⁻¹ is the resolvent of the empirical covariance Σ̂ = (1/n)XᵀX, and Λ = diag(Q(λ))⁻¹ is a diagonal matrix built from the inverse of the diagonal entries of Q(λ). This representation shows that the entire predictor depends only on the spectrum of Σ̂, but in a non‑standard way because Q(λ) and Λ do not commute. Consequently, ˆA is generally non‑symmetric, yet it shares the same eigenvalues as the symmetric matrix I − Λ^{1/2} Q(λ) Λ^{1/2}, guaranteeing a real spectrum.

The authors then study the proportional asymptotic regime where n,d → ∞ with n/d → α ∈ (0,∞). Under standard moment assumptions on the data matrix X = Z Σ^{1/2} (Z has i.i.d. zero‑mean, unit‑variance entries with bounded 4 + ε moments), they prove a deterministic equivalent for Λ: as dimensions grow, Λ converges to  Λ̂ =


Comments & Academic Discussion

Loading comments...

Leave a Comment