Explaining Grokking in Transformers through the Lens of Inductive Bias
We investigate grokking in transformers through the lens of inductive bias: dispositions arising from architecture or optimization that let the network prefer one solution over another. We first show that architectural choices such as the position of Layer Normalization (LN) strongly modulates grokking speed. This modulation is explained by isolating how LN on specific pathways shapes shortcut-learning and attention entropy. Subsequently, we study how different optimization settings modulate grokking, inducing distinct interpretations of previously proposed controls such as readout scale. Particularly, we find that using readout scale as a control for lazy training can be confounded by learning rate and weight decay in our setting. Accordingly, we show that features evolve continuously throughout training, suggesting grokking in transformers can be more nuanced than a lazy-to-rich transition of the learning regime. Finally, we show how generalization predictably emerges with feature compressibility in grokking, across different modulators of inductive bias. Our code is released at https://tinyurl.com/y52u3cad.
💡 Research Summary
This paper presents a systematic investigation into the phenomenon of “grokking” in transformers, analyzing it through the conceptual framework of “inductive bias.” Grokking refers to the delayed generalization observed in neural networks, where they achieve near-perfect training accuracy early on but only generalize to unseen test data after a significantly longer period of training. The authors posit that the speed and nature of grokking are governed by the inductive biases—inherent preferences for certain solutions—imbued by the model’s architecture and optimization process.
The study is structured around three core questions. First, it examines how architectural choices serve as a source of inductive bias and modulate grokking. Using a one-layer transformer trained on modular addition, the authors probe the effect of Layer Normalization (LN) placement. They create variants applying LN to inputs of the Multi-Head Self-Attention (MHSA) module, the MLP, both, or neither. The results are striking: the position of LN dramatically alters grokking speed. The configuration with LN only on MLP inputs (“M”) generalizes fastest, while the no-LN configuration is slowest. This demonstrates that LN is not merely a stabilization tool but a key architect of the model’s learning bias.
The mechanism is dissected by isolating how LN on specific pathways shapes learning. The analysis reveals three critical biases: 1) MLP input scale sensitivity: Without LN, the MLP tends to rely on the norm (scale) of its inputs rather than their direction for shortcut learning, hindering the discovery of generalizable features. LN removes this scale dependence. 2) Attention score entropy: Applying LN to query and key inputs reduces the entropy of attention score distributions, limiting the expressivity needed for the trigonometric compositions central to solving modular addition. 3) Value input scale: LN on the value inputs (“A^v”) reduces radial variation in the attention output passed to the MLP, accelerating grokking, though not as effectively as direct LN on the MLP input.
Second, the paper explores how optimization choices constitute another source of inductive bias. It scrutinizes hyperparameters like learning rate and weight decay, and their interaction with previously proposed metrics like “readout scale,” often used as a proxy for “lazy training” (where parameters change little). The authors find that in their setting, the interpretation of readout scale can be confounded by learning rate and weight decay. This challenges a simplistic narrative of grokking as a sharp transition from a lazy to a rich (feature-learning) regime. Instead, they show that features evolve continuously throughout training, suggesting a more nuanced process.
Finally, the research identifies a unifying principle across different modulators of inductive bias: feature compressibility. Regardless of whether grokking is accelerated by architectural changes (like LN placement) or optimization settings, generalization predictably emerges alongside an increase in the compressibility and structure of the learned features. For instance, as models grok, their embedding matrices develop clear, periodic patterns in the Fourier domain. This indicates that grokking is fundamentally driven by the model’s progressive discovery of structured, compressible representations, guided by its inherent biases.
In conclusion, this work reframes grokking from a mysterious anomaly to a predictable consequence of a model’s inductive biases. It provides concrete evidence that architectural elements like LN and optimization settings actively shape the learning trajectory, favoring either shortcut memorization or the discovery of generalizable rules. The insights offer levers for potentially controlling generalization speed and deepen our understanding of how transformers learn.
Comments & Academic Discussion
Loading comments...
Leave a Comment