Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.


💡 Research Summary

The paper investigates the “attention sink” phenomenon—where large language models (LLMs) allocate a disproportionate amount of attention weight to the first token—and reveals that this behavior implicitly implements a Mixture‑of‑Experts (MoE) routing mechanism within the attention layer. By mathematically analyzing three prevalent attention variants—Vanilla Attention, Sink Attention (as used in GPT‑OSS), and Gated Attention (as used in Qwen3‑Next)—the authors show that the attention weight assigned to the sink (either the first token or a learnable “sink” scalar) functions as an implicit gating factor. Specifically, the effective contribution of a head is scaled by (1 − A_sink), which is mathematically equivalent to an explicit gating value in Gated Attention. Consequently, each head acts as an independent expert, and the sink‑derived gating routes information among these experts, forming a native MoE structure.

This hidden MoE leads to a “head collapse” problem analogous to expert collapse in conventional MoE models: a small subset of heads consistently receives high activation while the majority remain dormant. Empirical analyses across multiple models (LLaMA, Qwen, etc.) confirm that the first token’s value vector collapses to near‑zero, and that head usage statistics are highly skewed. The collapse limits representational capacity, especially in long‑context scenarios where precise query‑key discrimination is essential.

To mitigate head collapse, the authors propose a “Sink‑Aware Training” framework. The key ideas are: (1) reinterpret the sink weight as an explicit gate G = 1 − A_sink, and (2) add an auxiliary load‑balancing loss L_balance = ∑_h (π_h − 1/H)^2, where π_h denotes the average routing probability of head h and H is the total number of heads. During training from scratch, the loss encourages uniform routing across all heads. For fine‑tuning pretrained models where collapse has already emerged, the method fixes the top‑activated heads (preserving learned behavior) and applies the balancing loss only to the remaining heads, preventing disruption of existing knowledge.

Implementation leverages the log‑sum‑exp (LSE) values already computed in Flash Attention, allowing the gating and balancing terms to be calculated without extra memory or compute overhead. Experiments cover models of 0.6 B, 1 B, and 2 B parameters trained under six configurations (Vanilla, Sink, Gated, each with/without the balancing loss) and fine‑tuning of large pretrained checkpoints such as LLaMA‑3 and Qwen‑3. Results demonstrate: (i) a more uniform distribution of head usage; (ii) reductions in perplexity (≈1‑2 %); (iii) consistent gains on long‑context benchmarks (LongBench) and general language understanding (MMLU); and (iv) improved robustness to quantization and pruning, as balanced heads tolerate higher compression ratios.

The paper concludes that attention sink is not a bug but an implicit MoE routing mechanism. By making this routing explicit and enforcing load balancing, head collapse can be effectively alleviated, leading to better utilization of model capacity and superior performance, particularly in tasks requiring long‑range dependency modeling. This work opens a new perspective on attention mechanisms and suggests that future LLM architectures could deliberately incorporate or refine native MoE structures for efficiency and scalability.


Comments & Academic Discussion

Loading comments...

Leave a Comment