A Capacity-Based Rationale for Multi-Head Attention

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

We study the capacity of the self-attention key-query channel: for a fixed budget, how many distinct token-token relations can a single layer reliably encode? We introduce Relational Graph Recognition, where the key-query channel encodes a directed graph and, given a context (a subset of the vertices), must recover the neighbors of each vertex in the context. We measure resources by the total key dimension $D_K = h,d_k$. In a tractable multi-head model, we prove matching information-theoretic lower bounds and upper bounds via explicit constructions showing that recovering a graph with $m’$ relations in $d_{\text{model}}$-dimensional embeddings requires $D_K$ to grow essentially as $m’/d_{\text{model}}$ up to logarithmic factors, and we obtain corresponding guarantees for scaled-softmax attention. This analysis yields a new, capacity-based rationale for multi-head attention: even in permutation graphs, where all queries attend to a single target, splitting a fixed $D_K$ budget into multiple heads increases capacity by reducing interference from embedding superposition. Controlled experiments mirror the theory, revealing sharp phase transitions at the predicted capacity, and the multi-head advantage persists when adding softmax normalization, value routing, and a full Transformer block trained with frozen GPT-2 embeddings.

💡 Research Summary

The paper provides a rigorous capacity analysis of the self‑attention key‑query (Q‑K) channel in Transformers. It asks, for a fixed budget of total key dimension (D_K = h,d_k) (where (h) is the number of heads and (d_k) the per‑head key dimension), how many distinct token‑to‑token relationships can a single attention layer reliably encode? To answer this, the authors introduce a formal task called Relational Graph Recognition (RGR). In RGR, a directed graph (G=(V,E)) with (m=|V|) vertices and (m’ = |E|) edges is encoded in the embeddings of the vertices. Given any context (C\subseteq V) (an ordered subset of vertices), the attention layer must output, for each vertex in the context, the set of its in‑context neighbors (N_G(v;C)={v’\in C\mid (v,v’)\in E}). The task isolates the Q‑K computation (the “where‑to‑attend” part) while ignoring the value (V) pathway, thereby focusing on the resource that directly determines attention patterns.

Two variants of the attention model are studied. The first, called “max‑over‑heads”, equips each head (k) with its own projection matrices (W^{(k)}_Q, W^{(k)}K) and computes a scalar score (S^{(k)}{pq}=q^{(k)}p\cdot k^{(k)}q). The final score for a pair ((p,q)) is the maximum over heads, (S{pq}^{\max}= \max_k S^{(k)}{pq}). An edge is declared present if this score exceeds a global threshold. This aggregation introduces a minimal non‑linearity that mimics the competitive routing of softmax while remaining analytically tractable. The second variant is the standard scaled‑softmax attention, where each head produces a probability distribution over the context and the per‑head probabilities are summed.

The authors first derive an information‑theoretic lower bound. To recover any graph from the family (\mathcal{G}_{m,m’}) (all directed graphs on (m) vertices with (m’) edges) for all possible contexts, the total key dimension must satisfy
\

A Capacity-Based Rationale for Multi-Head Attention

💡 Research Summary

Comments & Academic Discussion

Leave a Comment