Bayes optimal learning of attention-indexed models

Bayes optimal learning of attention-indexed models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in self-attention layers, that are key components of modern architectures.


💡 Research Summary

This paper introduces the Attention‑Indexed Model (AIM), a high‑dimensional synthetic data framework designed to capture the essential structure of self‑attention layers in modern transformers. In AIM each token is represented by a d‑dimensional Gaussian embedding x_a (a = 1…T). For each of L layers a learnable symmetric matrix S^ℓ∈ℝ^{d×d} interacts bilinearly with the embeddings to produce an “attention index”
h^{ℓ}{ab}=x_a^{⊤}S^{ℓ}x_b−δ{ab}Tr S^{ℓ}/√d.
When S^{ℓ}= (1/√{r_ℓ d}) W_ℓW_ℓ^{⊤} with W_ℓ∈ℝ^{d×r_ℓ}, the model reproduces the key‑query product of a transformer; r_ℓ is the width of the ℓ‑th attention head. Crucially, the authors allow r_ℓ to scale proportionally with d (ρ_ℓ=r_ℓ/d=Θ(1)), a regime they call extensive‑width, which matches practical architectures where key and query matrices are full‑rank.

The data generation process stacks L such indices through a nonlinear map g that typically involves a softmax with inverse temperature β. The output y∈ℝ^{T×T} encodes pairwise token relationships. The authors study two learning tasks: (i) estimating the hidden matrices S^{ℓ} (estimation error) and (ii) predicting y for a new input (generalization error). They assume a Bayes‑optimal learner who knows the generative prior P_S and the nonlinearity g, and they seek the posterior mean estimator.

Working in the high‑dimensional limit d→∞ with sample size n scaling as n=α d² (α=Θ(1)), fixed token length T and depth L, the authors apply tools from statistical mechanics (replica method) and random matrix theory. They reduce the problem to a set of low‑dimensional order parameters: the posterior overlap matrix q_{ℓk}= (1/d)E_{post}


Comments & Academic Discussion

Loading comments...

Leave a Comment