A Provable Expressiveness Hierarchy in Hybrid Linear-Full Attention
Transformers serve as the foundation of most modern large language models. To mitigate the quadratic complexity of standard full attention, various efficient attention mechanisms, such as linear and hybrid attention, have been developed. A fundamental gap remains: their expressive power relative to full attention lacks a rigorous theoretical characterization. In this work, we theoretically characterize the performance differences among these attention mechanisms. Our theory applies to all linear attention variants that can be formulated as a recurrence, including Mamba, DeltaNet, etc. Specifically, we establish an expressiveness hierarchy: for the sequential function composition-a multi-step reasoning task that must occur within a model’s forward pass, an ($L+1$)-layer full attention network is sufficient, whereas any hybrid network interleaving $L-1$ layers of full attention with a substantially larger number ($2^{3L^2}$) of linear attention layers cannot solve it. This result demonstrates a clear separation in expressive power between the two types of attention. Our work provides the first provable separation between hybrid attention and standard full attention, offering a theoretical perspective for understanding the fundamental capabilities and limitations of different attention mechanisms.
💡 Research Summary
**
This paper provides the first rigorous theoretical comparison between full‑attention Transformers and the increasingly popular efficient variants that incorporate linear or sparse attention. The authors focus on two canonical tasks that capture essential aspects of multi‑step reasoning: (1) L‑Sequential Function Composition (L‑FuncComp), a task that requires applying a sequence of L functions in order, and (2) the 2‑Sum problem, which demands a global view of all input tokens to decide whether any two numbers sum to a target value.
For L‑FuncComp, prior work (Chen et al., 2025) showed that an (L + 1)‑layer full‑attention network can solve the task, while an L‑layer full‑attention network cannot. Building on this, the authors define a hybrid architecture denoted (L, a₁,…,a_L)‑Hybrid, consisting of L full‑attention layers each followed by a_i linear‑attention layers. Their main result (Theorem 1.1) proves that for any integer L, a hybrid model with only L − 1 full‑attention layers and an astronomically larger number of linear layers—specifically 2^{3L²} linear layers after each full‑attention layer—fails to solve L‑FuncComp whenever the product of heads, dimension and precision (H·d·p) is bounded by n² − 4L − 2 (where n is the prompt length). In other words, even an exponential increase in linear‑attention depth cannot compensate for the loss of global token interactions provided by full attention.
The proof hinges on interpreting a single linear‑attention head as a recurrent neural network (RNN) with hidden dimension d² + d (Lemma 2.2). By treating each linear block as an RNN, the authors apply the “indistinguishable decomposition” technique from Chen et al. (2025) to show that the information that can be propagated forward through the linear stacks is fundamentally limited. Consequently, the hybrid model cannot maintain the distinct representations needed to correctly compose L functions.
The second contribution concerns sparse attention. The paper defines a (B, k)‑sparse attention layer that partitions the input into blocks of size B, compresses each block, scores them against the current token, and selects the top‑k blocks for attention. Theorem 1.2 establishes a lower bound for solving the 2‑Sum problem with such a layer: any successful sparse‑attention model must satisfy H·d·p = Ω(B log n). By contrast, a single‑layer full‑attention model can solve 2‑Sum with just H = 1, d = 3, and p = log n, i.e., O(log n) resources. The authors prove this using communication‑complexity arguments, showing that block compression inevitably discards the global information required for 2‑Sum unless the block size (and thus resource usage) is large.
Overall, the paper makes three key contributions:
- It establishes a provable expressiveness hierarchy between full‑attention and hybrid linear‑full attention models, demonstrating that linear attention—even in massive depth—cannot replace the global mixing power of full attention for multi‑step compositional tasks.
- It formalizes linear attention as an RNN, linking the analysis to a rich body of work on RNN‑Transformer comparisons and highlighting that hybrid models inherit the representational constraints of recurrent architectures.
- It provides the first hardness result for single‑layer sparse attention on a global‑interaction task, reinforcing the intuition that full attention remains optimal when uniform token access is required.
The practical implication is clear: designers of large language models must be cautious when substituting full attention with linear or sparse variants. While such substitutions reduce computational cost, they incur a provable loss in expressive power that cannot be mitigated simply by adding more layers. For tasks demanding deep reasoning or global token relationships, hybrid models should either retain a sufficient number of full‑attention layers or be augmented with additional mechanisms (e.g., chain‑of‑thought prompting, external memory) to bridge the gap. Future work may explore hybrid designs that strategically allocate full‑attention capacity where it is most needed, or develop new attention kernels that preserve global information without incurring quadratic cost.
Comments & Academic Discussion
Loading comments...
Leave a Comment