SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

SPLA: Block Sparse Plus Linear Attention for Long Context Modeling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining “long tail,” SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA – calculating the residual as the difference between global and selected linear attention – ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.


💡 Research Summary

The paper introduces SPLA (Sparse Plus Linear Attention), a hybrid attention framework designed to enable efficient long‑context modeling for large language models while preserving the quality of dense attention. The authors first identify two fundamental shortcomings of existing block‑wise sparse attention methods: (1) low fidelity in block selection, because current metrics are heuristic and loosely tied to the original token‑level attention objective, and (2) cumulative loss of contextual information, as unselected blocks are simply discarded, causing a “long‑tail” divergence that degrades generation quality as sequence length grows.

To address these issues, SPLA proposes a principled block‑selection metric derived from a second‑order Taylor expansion of the softmax attention function. By computing the mean (\bar{k}) and a diagonal approximation of the covariance Cov(k) for each block, the method approximates the block’s total unnormalized attention mass as (\exp(q^\top \bar{k})\bigl(1+\frac{1}{2}q^\top \text{Cov}(k) q\bigr)). This captures both first‑order (mean) and second‑order (variance) information, yielding a more accurate estimate of each block’s contribution without requiring any additional learned parameters. The resulting scores are fed into a Top‑k or Top‑p selector, producing a high‑recall set of “exact” blocks that will be processed with full (dense) attention.

The second innovation is Residual Linear Attention (RLA), which handles the remaining “approximate” blocks. Instead of discarding them, SPLA runs a standard linear‑attention recurrence over the entire sequence to obtain a global state (\bar{S}_t). Simultaneously, a second recurrence (\tilde{S}_t) accumulates contributions only from the selected blocks; this second state is computed inside the same sparse kernel that already loads the selected blocks into SRAM. The residual output is then defined as the difference:
\


Comments & Academic Discussion

Loading comments...

Leave a Comment