Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime
Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.
💡 Research Summary
This paper provides a rigorous theoretical analysis of neural scaling laws for shallow two‑layer networks operating in the feature‑learning regime, moving beyond the extensively studied lazy (kernel) setting. The authors consider two specific architectures: (i) a diagonal linear network where the first‑layer weight matrix is diagonal and the activation is linear, and (ii) a quadratic‑activation network whose output can be expressed as a trace of a rank‑p matrix built from the first‑layer weights. By re‑parameterizing the diagonal network, the empirical risk minimization (ERM) problem with ℓ₂ weight decay is shown to be exactly equivalent to a LASSO regression on an effective weight vector θ = a⊙w/√d. Similarly, the quadratic network maps to a low‑rank matrix compressed‑sensing problem with nuclear‑norm regularization on the symmetric matrix S = WᵀW/√(pd).
The target function is generated by a teacher network of the same architecture, with the effective coefficients (θ* for the diagonal case, eigenvalues of S* for the quadratic case) drawn from a heavy‑tailed power‑law distribution characterized by an exponent γ > ½. This “quasi‑sparse” assumption captures realistic signals whose Fourier or wavelet coefficients decay polynomially.
Using Approximate Message Passing (AMP) and its state‑evolution (SE) equations, the authors derive deterministic formulas for the excess risk R = E
Comments & Academic Discussion
Loading comments...
Leave a Comment