A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).

💡 Research Summary

The paper investigates two pervasive outlier phenomena in large language models—attention sinks (a few tokens that consistently receive disproportionately large attention logits) and residual sinks (a small set of hidden‑state dimensions that exhibit abnormally large activations across most tokens). The authors propose that both types of outliers are not pathological bugs but functional components that work together with the model’s normalization layers (softmax in attention and RMSNorm in the residual stream) to rescale the non‑outlier components. They name this mechanism “outlier‑driven rescaling.”

Key empirical findings are as follows:

Normalization is essential for outlier formation and for training stability. Replacing softmax with sigmoid attention or substituting RMSNorm with a point‑wise function such as Dynamic tanh (DyT) dramatically reduces the magnitude of both attention and residual sinks, but the models become unstable, diverge, or suffer a large increase in final loss. This demonstrates that the outliers are a by‑product of the scaling effect of the normalizations, not a source of error.
Directly clipping or suppressing outliers while keeping the normalizations harms performance. When the authors clip the extreme logits or activations, the rescaling effect is broken; training either degrades or diverges. This explains why architectural tricks that unintentionally limit outlier magnitude (e.g., using sigmoid‑based GLU) often under‑perform.
Outlier dimensions act primarily as scaling factors. RMSNorm learns very small gain parameters for the dimensions that host residual sinks (e.g., 0.006 versus a mean of ~1). The authors prove that, under this condition, the norm of the post‑normalization vector is bounded above by a term that decreases as the outlier magnitude grows, confirming the scaling‑only role.
Outliers can be absorbed into learnable parameters. By inserting a lightweight learnable vector before the normalization layer, the model can shift the large values from the activation space into parameters, preserving the rescaling effect while keeping activations modest. This “parameter absorption” is lossless and simplifies downstream quantization.
Explicit gating provides a clean alternative. Adding a lightweight gating mechanism after RMSNorm (GatedNorm) or using GatedAttention (GA) reproduces the rescaling function without relying on extreme logits or activations. Experiments show that models with gating reduce both attention and residual sinks, achieve an average 2‑point gain in downstream evaluation metrics, and improve 4‑bit (W4A4) quantization robustness by reducing the degradation from ~3 points to ~1.8 points. Moreover, once gating is present, the model’s sensitivity to other architectural choices diminishes: sigmoid‑GLU matches or exceeds SwiGLU, and DyT becomes stable.

The authors validate these observations across a broad spectrum of configurations: models ranging from 1 B to 24 B parameters, trained on datasets from 120 B to 1 T tokens, and using pure softmax, linear, or hybrid attention. Table 1 systematically compares various rescaling strategies (full attention, GA, linear, hybrid, DyT, clipping, GLU variants, Pre‑Affine, GatedNorm, etc.) in terms of outlier magnitude, final loss, and training stability. The consistent pattern is that any method that preserves an outlier‑driven rescaling—whether through traditional normalization, learned gates, or parameter absorption—maintains or improves performance, while methods that merely suppress outliers without providing an alternative scaling mechanism degrade the model.

In summary, the paper reframes attention sinks and residual sinks as intentional scaling mechanisms that enable transformers to control the magnitude of the residual stream and attention outputs. Rather than treating them as anomalies to be eliminated, the work suggests designing models to either harness these outliers via normalization or replace them with explicit, learnable gating. This perspective not only clarifies the functional role of extreme activations but also offers practical recipes for more stable training, better downstream performance, and enhanced quantization robustness.

A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

💡 Research Summary

Comments & Academic Discussion

Leave a Comment