Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences

Transformers Are Born Biased: Structural Inductive Biases at Random Initialization and Their Practical Consequences
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Transformers underpin modern large language models (LLMs) and are commonly assumed to be behaviorally unstructured at random initialization, with all meaningful preferences emerging only through large-scale training. We challenge this assumption by showing that randomly initialized transformers already exhibit strong and systematic structural biases. In particular, untrained models display extreme token preferences: across random input sequences, certain tokens are predicted with probabilities orders of magnitude larger. We provide a mechanistic explanation for this phenomenon by dissecting the transformer architecture at initialization. We show that extreme token preference arises from a contraction of token representations along a random seed-dependent direction. This contraction is driven by two interacting forces: (i) asymmetric nonlinear activations in MLP sublayers induce global (inter-sequence) representation concentration, and (ii) self-attention further amplifies this effect through local (intra-sequence) aggregation. Together, these mechanisms align hidden representations along a direction determined solely by the random initialization, producing highly non-uniform next-token predictions. Beyond mechanistic insight, we demonstrate that these initialization-induced biases persist throughout training, forming a stable and intrinsic model identity. Leveraging this property, we introduce SeedPrint, a fingerprinting method that can reliably distinguish models that differ only in their random initialization, even after extensive training and under substantial distribution shift. Finally, we identify a fundamental positional discrepancy inherent to the attention mechanism’s intra-sequence contraction that is causally linked to the attention-sink phenomenon. This discovery provides a principled explanation for the emergence of sinks and offers a pathway for their control.


💡 Research Summary

The paper overturns the common belief that a randomly‑initialized transformer is a featureless blank slate. By feeding thousands of uniformly random token sequences into untrained nano‑GPT‑2, nano‑LLaMA‑2, and a 1.2 B‑parameter GPT‑2, the authors show that the model’s next‑token prediction is dominated by a tiny subset of vocabulary items. The preferred token can be up to 58.5 × more frequent than a uniform baseline, and the effect grows with model size and sequence length.

Mechanistically, two interacting forces create this bias. First, the asymmetric non‑linearity (e.g., GELU) in the MLP sub‑layer contracts representations of different sequences toward a common direction, a phenomenon the authors call inter‑sequence concentration. Second, self‑attention aggregates value vectors within a sequence, amplifying the already‑aligned direction (intra‑sequence concentration). The combined contraction aligns the final hidden states along a seed‑dependent vector, causing the output logits to assign disproportionately high probability to the token whose embedding aligns best with that direction.

Crucially, the bias survives full pre‑training. Models trained on large corpora retain the same seed‑specific token preference, effectively encoding a persistent “model identity.” Leveraging this, the authors introduce SeedPrint, a fingerprinting method that distinguishes models that differ only by their random seed, even after extensive training and under distribution shift, achieving near‑perfect discrimination.

The paper also links the bias to the widely observed attention‑sink phenomenon. A positional variance inherent to the attention mechanism’s intra‑sequence contraction is identified as the causal factor behind sinks. Simple architectural tweaks—such as normalizing value vectors or adjusting positional scaling—substantially reduce sink strength, offering a principled mitigation strategy.

Overall, the work provides a thorough empirical and theoretical account of structural inductive biases present at initialization, demonstrates their persistence, and translates the insight into practical tools for model provenance and for controlling undesirable attention behaviors. Limitations include the focus on relatively small models and a single initialization scheme; future work should verify scalability to massive LLMs and explore alternative initializations.


Comments & Academic Discussion

Loading comments...

Leave a Comment