In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks

In-Context Learning in Linear vs. Quadratic Attention Models: An Empirical Study on Regression Tasks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent work has demonstrated that transformers and linear attention models can perform in-context learning (ICL) on simple function classes, such as linear regression. In this paper, we empirically study how these two attention mechanisms differ in their ICL behavior on the canonical linear-regression task of Garg et al. We evaluate learning quality (MSE), convergence, and generalization behavior of each architecture. We also analyze how increasing model depth affects ICL performance. Our results illustrate both the similarities and limitations of linear attention relative to quadratic attention in this setting.


💡 Research Summary

This paper presents a systematic empirical comparison of two attention mechanisms—standard quadratic (soft‑max) attention and linear (kernel‑based) attention—on the canonical in‑context learning (ICL) benchmark introduced by Garg et al., which consists of learning linear regression functions from a prompt of input‑output pairs. The authors implement both families of transformers with identical architectural hyper‑parameters (input dimension 5, embedding size 256, 4 attention heads, MLP expansion ×4) and evaluate three depths (1, 3, 6 layers). Training settings are matched as closely as possible: quadratic models use a learning rate of 1e‑4, batch size 32, and 30 000 steps; linear models use a learning rate of 3e‑4, batch size 64, and depth‑dependent steps (7.5k–10k). The linear attention implementation follows Katharopoulos et al., employing a squared‑ReLU feature map (ϕ(x)=ReLU(x)²) to guarantee non‑negativity, a √d scaling before the map, and a recurrent formulation that maintains a running key‑value outer‑product matrix Sₜ and a normalization vector zₜ, achieving O(T d²) complexity.

Performance is measured by normalized mean‑squared error (MSE) on the final query point of each prompt, with additional evaluation on an anisotropic test distribution where inputs are drawn from a diagonal covariance Σ = diag(


Comments & Academic Discussion

Loading comments...

Leave a Comment