Clustering in Deep Stochastic Transformers

Clustering in Deep Stochastic Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformers have revolutionized deep learning across various domains but understanding the precise token dynamics remains a theoretical challenge. Existing theories of deep Transformers with layer normalization typically predict that tokens cluster to a single point; however, these results rely on deterministic weight assumptions, which fail to capture the standard initialization scheme in Transformers. In this work, we show that accounting for the intrinsic stochasticity of random initialization alters this picture. More precisely, we analyze deep Transformers where noise arises from the random initialization of value matrices. Under diffusion scaling and token-wise RMS normalization, we prove that, as the number of Transformer layers goes to infinity, the discrete token dynamics converge to an interacting-particle system on the sphere where tokens are driven by a \emph{common} matrix-valued Brownian noise. In this limit, we show that initialization noise prevents the collapse to a single cluster predicted by deterministic models. For two tokens, we prove a phase transition governed by the interaction strength and the token dimension: unlike deterministic attention flows, antipodal configurations become attracting with positive probability. Numerical experiments confirm the predicted transition, reveal that antipodal formations persist for more than two tokens, and demonstrate that suppressing the intrinsic noise degrades accuracy.


💡 Research Summary

The paper tackles a fundamental gap in the theoretical understanding of deep Transformer networks: the role of intrinsic stochasticity that arises from the standard random initialization of the value matrices (V). While prior works on deep Transformers have largely assumed deterministic weights (or tied weights across layers) and consequently predicted that token representations inevitably collapse to a single point (rank‑collapse), this assumption ignores the randomness that is present in real implementations.

The authors consider a minimal yet expressive architecture consisting of three components: (i) a self‑attention block (either soft‑max or its unnormalized proxy), (ii) residual connections scaled by 1/√L (the diffusion scaling regime), and (iii) token‑wise RMS normalization that projects each token onto the unit sphere. The only source of randomness is the sequence of i.i.d. value matrices Vₙ, each with zero‑mean, variance σ² entries and bounded support, matching common initializations such as Glorot or He. Queries and keys are kept fixed to keep the analysis tractable.

Under these settings, the discrete update Xᵢⁿ⁺¹ = Xᵢⁿ + (1/√L) Vₙ₊₁ A_β(Xᵢⁿ, Xⁿ) is shown to converge, as the depth L → ∞, to a system of stochastic differential equations (SDEs) on the sphere S^{d‑1}. The limiting dynamics are dYᵢ(t) = P_{⊥Yᵢ(t)}


Comments & Academic Discussion

Loading comments...

Leave a Comment