To Grok Grokking: Provable Grokking in Ridge Regression
We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the “grokking time”) in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.
💡 Research Summary
The paper provides the first rigorous end‑to‑end analysis of “grokking,” the phenomenon where a model’s test performance improves only after a long period of perfect training accuracy, in the simplest possible setting: over‑parameterized linear regression with ℓ₂ regularization (ridge regression) trained by vanilla gradient descent (GD) with weight decay.
Problem setting – A teacher function N*(x)=⟨θ*,ϕ(x)⟩ is assumed realizable by a linear model over an arbitrary feature map ϕ. The student model N(x;θ)=⟨θ,ϕ(x)⟩ is trained on n i.i.d. samples by minimizing the regularized empirical loss
Lₙ(θ;λ)=½n∑ᵢ(N(xᵢ;θ)−N*(xᵢ))²+½λ‖θ‖²,
with learning rate η and regularization strength λ>0.
Main theoretical contributions –
- Fast training‑error decay (Theorem 4.4). The empirical loss shrinks exponentially fast with a rate proportional to η·σ_min²(Φ)/n, where σ_min(Φ) is the smallest singular value of the data matrix Φ. This guarantees that the model overfits (training error ≤ε) after a modest number of steps t₁.
- Slower generalization‑error decay (Theorem 4.5). The population loss L(θ)=E
Comments & Academic Discussion
Loading comments...
Leave a Comment