Sparse Attention as Compact Kernel Regression
Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation – including Epanechnikov, biweight, and triweight – correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers – Memory Mosaics – show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.
💡 Research Summary
This paper bridges the gap between sparse attention mechanisms in transformers and classical non‑parametric kernel regression. Building on the known equivalence between softmax attention and Gaussian‑kernel Nadaraya‑Watson estimation, the authors systematically replace the Gaussian kernel with compact‑support kernels (Epanechnikov, biweight, triweight, top‑k truncated Gaussian, and uniform kernels) and derive the corresponding attention transformations.
Key contributions are:
-
Normalized ReLU as Epanechnikov regression – By expressing the Epanechnikov kernel in dot‑product form, the authors show that applying a ReLU activation followed by normalization yields exactly the same weighting as Nadaraya‑Watson regression with an Epanechnikov kernel. Sparsity arises because the ReLU truncates negative similarities.
-
Auto‑normalization leading to sparsemax – Instead of fixing the kernel bandwidth, they adapt it so that the denominator of the Nadaraya‑Watson estimator is constant (equal to one). This forces the normalized weights to be the rectified responses themselves, which matches the sparsemax formulation with threshold τ. Hence sparsemax can be interpreted as Epanechnikov regression with an adaptive bandwidth determined solely by the keys.
-
α‑entmax ↔ higher‑order compact kernels – For any α > 1, α‑entmax’s power‑law rectification corresponds to a polynomial kernel of order r = (α − 1)⁻¹: r = 1 gives Epanechnikov, r = 2 biweight, r = 3 triweight, etc. As α → ∞ the kernel order diverges and the kernel converges to a Gaussian, recovering softmax. This provides a unified view where sparsity level is directly controlled by the kernel’s order.
-
Top‑k and uniform kernels – Top‑k softmax is shown to be equivalent to a Gaussian kernel truncated to the k‑nearest neighbors, while a uniform kernel yields a top‑k uniform attention where all points within a fixed radius receive equal weight. These formulations give a principled kernel‑theoretic justification for heuristic top‑k truncation.
-
ReLUmax – A novel “max‑anchored” Epanechnikov variant that centers the support on the highest‑scoring key, offering stronger sparsity while remaining differentiable.
The authors embed these kernels into the Memory Mosaics architecture, which explicitly implements the Nadaraya‑Watson estimator for each query. Experiments on three fronts demonstrate the practical benefits:
- Language modeling (WikiText‑103) – Compact‑kernel attention (α‑entmax with α = 4/3, 1.5, and ReLUmax) matches or slightly outperforms standard softmax in perplexity.
- In‑context learning – Sparse kernels reduce the influence of irrelevant context tokens, leading to higher accuracy on prompt‑based reasoning tasks.
- Length generalization – When sequence length is increased 2× or 4×, compact‑kernel attention mitigates attention dispersion, preserving representation quality and yielding more stable predictions than dense softmax.
Overall, the paper establishes a rigorous, unified framework: sparse attention = compact‑support kernel regression. It shows that kernel design directly dictates sparsity, locality, and adaptivity of attention, offering principled alternatives to heuristic sparsification methods and opening avenues for designing new attention mechanisms grounded in kernel theory.
Comments & Academic Discussion
Loading comments...
Leave a Comment