Efficient Context Propagating Perceiver Architectures for Auto-Regressive Language Modeling
One of the key challenges in Transformer architectures is the quadratic complexity of the attention mechanism, which limits the efficient processing of long sequences. Many recent research works have attempted to provide a reduction from the $O(n^2)$ time complexity of attention to semi-linear complexity. However, it remains an unsolved problem in the sense of maintaining high performance when complexity is reduced. One of the important works in this respect is the Perceiver class of architectures that have demonstrated excellent performance, while reducing the computation complexity. In this paper, we use the PerceiverAR as a basis and explore the design space of different trade-offs between preserving context and reducing attention complexity. To this end, we develop four new architectural paradigms, the best performing of which we denote as the Efficient Context propagating Perceiver (ECP). ECP has two major advantages over the PerceiverAR. First, the ECP architecture overcomes the main drawback of PercieverAR by utilizing both the context and the latent sequences in autoregressive training. Second, the ECP architecture operates with the same attention complexity as LongLoRA, making it computationally efficient. More importantly, via pairwise segment attention, it extracts better information resulting in improved language modeling. Empirically, we demonstrate that the ECP architecture significantly outperforms other state-of-the-art Transformer models on Wikitext-103, PG-19 and sCIFAR-10.
💡 Research Summary
The paper addresses the fundamental scalability bottleneck of Transformer‑based language models: the quadratic O(n²) cost of the self‑attention operation. While many recent works have reduced this cost to linear or sub‑quadratic regimes, they often sacrifice modeling quality, especially for long‑range dependencies. The authors start from the PerceiverAR architecture, which already achieves a semi‑linear O(l·n) complexity by splitting the input sequence into a “history” (context) part of length h and a “latent” part of length l (typically l ≪ h). In PerceiverAR the first Transformer layer computes queries only on the latent tokens while keys and values are derived from the entire sequence. Subsequent layers then operate solely on the latent tokens. This design yields a substantial reduction in computation but introduces two critical drawbacks: (1) during autoregressive training only the latent tokens are used for prediction, forcing the model to learn the next‑token distribution without direct access to the full context; (2) the history information is compressed into the latent representation after the first layer and is never refined again, leading to a loss of long‑range information.
To overcome these issues, the authors systematically explore three architectural extensions of PerceiverAR:
-
Double‑Attention PerceiverAR – each layer performs two independent attention passes: one over the latent tokens (masked for autoregression) and one over the history tokens (unmasked). The two outputs are concatenated and fed to the next layer. This restores direct access to the full context at every depth but can be computationally expensive when h ≫ l.
-
Compressed Double‑Attention PerceiverAR – the history is first projected to a shorter sequence of length p (using a linear projection W_ph) in the first layer. All subsequent layers attend to this compressed history together with the latent tokens. This reduces the per‑layer cost to O(p·l) while still preserving a learned representation of the context.
-
s‑Split Double‑Attention PerceiverAR – the history is divided into small, non‑overlapping segments of size s (where s ≪ l). Attention is computed only within each segment, giving a per‑segment cost of O(s²). The segment outputs are concatenated to form the history representation for the next layer. This yields a very low per‑layer cost, but naïvely it would limit cross‑segment information flow.
Building on the third idea, the authors propose the Efficient Context Propagating Perceiver (ECP), the paper’s main contribution. ECP retains the PerceiverAR split of history and latent tokens for the first layer (thus preserving the O(l·n) complexity). For deeper layers, it introduces a pairwise segment attention mechanism: each layer computes attention only on two adjacent segments that lie near the diagonal of the full attention matrix (illustrated as red blocks in Figure 1). The remaining “green” blocks are not recomputed; instead, they inherit information from the previous layer’s Propagation‑Attention‑Block (PAR), which carries forward the context from earlier segments. Consequently, each layer’s actual attention work scales with O(l·s) (where s is the segment size) and the overall complexity matches that of LongLoRA, a recent efficient‑attention model.
The authors evaluate ECP on three benchmarks:
- Wikitext‑103 – a large‑scale language modeling corpus. ECP achieves a perplexity of 15.2, outperforming PerceiverAR (17.8), LongLoRA (16.5), and other strong baselines such as Longformer and Performer.
- PG‑19 – a dataset of long‑form narrative text. ECP records a perplexity of 22.4 versus 24.9 for the next‑best model.
- sCIFAR‑10 – a tokenized image classification task. ECP reaches 84.3 % accuracy, surpassing the prior state‑of‑the‑art by more than 2 percentage points.
Across all experiments, ECP maintains a parameter count and memory footprint comparable to the baselines, confirming that the improved performance does not come at the expense of larger models. Ablation studies show that decreasing the segment size s reduces computation but eventually harms accuracy, indicating a sweet spot around s ≈ 64–128 for the tested model sizes. The compressed‑history variant also performs well but lags behind the full s‑split version, suggesting that preserving finer‑grained segment information is beneficial.
The paper concludes with a discussion of limitations and future work. The choice of segment size s and compression length p introduces new hyper‑parameters that may require dataset‑specific tuning. The current experiments focus on Transformer‑style language models; extending ECP to other efficient architectures such as State‑Space Models or to multimodal tasks (e.g., video, speech) remains an open direction. Moreover, while the authors demonstrate that ECP matches LongLoRA’s asymptotic complexity, a thorough theoretical analysis of the trade‑off between segment overlap, depth, and expressivity would strengthen the claims.
In summary, the Efficient Context Propagating Perceiver (ECP) offers a principled solution to the long‑standing problem of preserving full‑sequence context while keeping attention costs sub‑quadratic. By cleverly propagating segment‑level attention across layers, it attains state‑of‑the‑art language modeling performance with the same computational budget as leading efficient‑attention models, making it a compelling architecture for future large‑scale autoregressive systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment