Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning
It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) remains unclear. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.
💡 Research Summary
This paper investigates the statistical advantage of re‑using training data in gradient‑based learning beyond linear regression, focusing on a d‑dimensional Gaussian single‑index model with a quadratic link function. The authors compare full‑batch gradient descent (GD), which repeatedly accesses the entire dataset, with one‑pass stochastic gradient descent (SGD), which sees each sample only once. For the standard quadratic activation σ(z)=z², they first show that full‑batch spherical GD on the correlation loss suffers the same logarithmic sample‑complexity barrier as one‑pass SGD: when n≪d log d, the algorithm’s iterates converge to a direction orthogonal to the true parameter θ⋆, yielding no weak recovery. This negative result follows from a spectral analysis of the empirical Hessian A⋆, which behaves like a spiked random matrix whose leading eigenvalue remains buried in the bulk unless n exceeds d log d.
The key positive contribution comes from a simple modification of the activation: truncating the quadratic nonlinearity, i.e., σ(z)=min{z²,M}. Under this truncated link, the authors prove a uniform BBP phase transition for the Hessian along the GD trajectory. When the sample‑to‑dimension ratio δ=n/d surpasses a constant threshold, the top eigenvalue separates from the noise bulk, and the associated eigenvector aligns positively with θ⋆. By invoking the stable‑manifold theorem, they show that spherical GD converges to a non‑trivial stationary point with overlap Θ(1), establishing weak recovery with only n≳d samples. Empirically, the required δ is essentially independent of d, confirming the removal of the log d factor.
To address strong (exact) recovery, the paper switches to the squared loss ℓ(θ)=(y−σ(⟨x,θ⟩))² and studies Euclidean full‑batch GD from a small random initialization. With the same truncated quadratic activation, the loss landscape becomes globally benign: the Hessian remains uniformly positive‑definite for all iterates, eliminating the non‑convex traps that plague the untruncated case. The authors prove that after T≳C log d gradient steps (for a universal constant C), the iterate satisfies ‖θ_T−θ⋆‖→0, i.e., strong recovery is achieved with the information‑theoretic optimal sample size n≳d. This is the first result showing exact recovery for full‑batch GD without algorithmic tricks or loss modifications in the proportional regime.
Overall, the paper delivers three major insights: (1) full‑batch GD does not automatically beat one‑pass SGD for even link functions; (2) a modest truncation of the activation creates a favorable spectral structure that eliminates the log d barrier, allowing full‑batch GD to match the optimal n≈d sample complexity; (3) under the squared loss and small initialization, full‑batch GD attains strong recovery in O(log d) iterations. The analysis blends random matrix theory (BBP transition), dynamical systems (stable manifold), and gradient‑flow techniques, offering a new template for studying data‑reuse benefits in non‑linear high‑dimensional learning. Potential extensions include multi‑index models, other non‑linear activations, and connections to multi‑epoch training in deep neural networks.
Comments & Academic Discussion
Loading comments...
Leave a Comment