Explaining Grokking and Information Bottleneck through Neural Collapse Emergence

Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.


💡 Research Summary

This paper offers a unified theoretical explanation for two puzzling late‑phase phenomena observed in deep neural network training: grokking—where test accuracy suddenly improves long after the training loss has plateaued—and the information‑bottleneck (IB) dynamics—where networks first memorize the training data and later compress irrelevant input information. The authors argue that both phenomena are driven by the same geometric process: the emergence of neural collapse, specifically the contraction of within‑class variance in the learned representation space.

First, they introduce a scale‑invariant “population within‑class variance” metric, defined as the expected squared distance between a normalized feature vector ĝ(x)=g(x)/B_g and its class‑conditional mean, averaged over the data distribution. This quantity captures how tightly representations of the same class cluster, independent of overall feature magnitude.

Theorem 3.2 shows that, for a fixed feature extractor and linear classifier, the generalization error is bounded above by a term that decreases both when the classifier weights align better with class means and when the population within‑class variance shrinks. Consequently, after the training loss has been driven to zero, further reduction of this variance can explain the abrupt rise in test performance characteristic of grokking.

Turning to the IB framework, the authors add an infinitesimal Gaussian noise σ to the deterministic representation to avoid infinite mutual information. They prove (Theorem 3.4) that the surplus information I(Z;X)−I(Z;Y)=I(Z;X|Y) is bounded by ½σ² E


Comments & Academic Discussion

Loading comments...

Leave a Comment