Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.
Attention sinks were first identified by Xiao et al. (2023), where the BOS token was observed to receive anomalously high attention weights. This phenomenon has since shown broad practical implications, including LLM quantization (Son et al., 2024;Liu et al., 2024), KV-cache optimization (Cai et al., 2025;2024), efficient LLM serving (Xiao et al., 2023), and model enhancement (Yu et al., 2024).
Many recent studies investigated the formation and functional role of the BOS sink. Cancedda (2024) analyzes attention sinks from a spectral subspace perspective, while Gu et al. (2024) interprets them as a form of positional bias that mitigates over-mixing. Building on this line of work, Queipo-de Llano et al. (2025) further examines their role in the depth-wise organization of the model, proposing that attention sinks serve as a mechanism for information compression along the depth dimension. Sun et al. (2024) and Yu et al. (2024) show that attention sinks are not limited to the BOS token; instead, multiple tokens can function as attention sinks. Ruscio et al. (2025) further analyzes this phenomenon from a geometric perspective, showing that the emergence of multiple attention sinks is closely tied to the model’s positional embedding scheme. In this view, attention sinks act as reference points in a high-dimensional representation space, enabling the model to establish a stable internal coordinate system. However, the multiple sinks identified in previous work are fundamentally the same as the BOS sink: they emerge at the same layers and persist throughout the network. In contrast, we identify a new type of sink that differs in both its layer of emergence and its lifetime.
In this work, we show that all of the multiple tokens that act as attention sinks can be organized into distinct sink levels. The primary level normally corresponds to the BOS sink: it emerges at the same layer as the BOS token and persists throughout the network. Additional sink levels arise in the middle layers and persist for a variable number of layers; we refer to these as Secondary Sinks. Subsequent sections study these secondary sinks in detail.
We first identify Secondary Sinks, especially their sink levels, the token sets and positions in which they frequently occur, through characterizing the similarities and differences between Secondary Sinks and the BOS sink across a range of models. (Section 3). We then quantify the contributions of different layers to their formation via an empirical analysis of their emergence across network depth (Section 4). Finally, we examine their impact on attention scores throughout the network.(Section 5) . We make the following conclusions:
• Unlike the primary sink, which emerges in early layers and persists throughout the entire network, Secondary Sinks arise primarily in middle layers and persist only for a couple of layers. They can be found at any position in the generation sequence and semantically uninformative tokens. • Secondary Sinks shares a similar direction with the primary sink. This direction is encoded in specific middle-layer l start MLP modules, which convert multiple orthogonal directions to the same sink direction. After l start , a set of semantically uninformative tokens are transformed into attention sinks. Meanwhile, the layers preceding l start play a key role in constructing this set, distinguishing these tokens from other semantically uninformative tokens. • Different levels of Secondary Sinks exhibit distinct lifetimes and attention sink strength. Both are strongly correlated with the ℓ 2 -norm of l start output. Larger models show clearer differentiation between sink levels, and models going through extensive post-training on reasoning data show stronger Secondary Sinks phenomenon. • Secondary Sinks show a compensating effect relative to the BOS sink. The BOS gradually decays and reaches its weakest strength in the middle layers, coinciding with the emergence of Secondary Sinks phenomenon.
Let f θ be a decoder-only transformer with L layers and hidden size h. At each decoder layer l, the decoder receives a hidden sequence of length t, H l ∈ R t×h = {h l 0 , h l 1 , . . . , h l t } T , where h i is the hidden representation at position i in layer l.
Decoder blocks Each decoder layer l consists of a multi-head self-attention (MHSA) module and a multi-layer perceptron (MLP) , both of which operate on the decoder input hidden state, also referred to as the residual stream H l ∈ R t×h . The MHSA produces an output O l ∈ R t×h , and the MLP produces F l ∈ R t×h ; in both cases, the outputs are added back to the residual stream. Decoders may use either a pre-norm or post-norm architecture for which the normalization is applied either before or after each module. The majority of modern models employ pre-norm as shown in Figure 4.
Position Embedding Position embeddings P provide tokens with positional information in the attention mechanism. Common P include absolute positional e
This content is AI-processed based on open access ArXiv data.