Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \mathbb{R}^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline, comfortably outperforming a model with 12.5% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.


💡 Research Summary

The paper revisits the role of the three linear projection matrices (W_Q, W_K, W_V) in transformer self‑attention. Building on the algebraic result of Karbevski and Mijoski (2025), the authors note that attention depends on the input X only through the products XW_Q, XW_K, and XW_V. Because these matrices are almost surely invertible, a change of basis Θ can be applied to the entire network without altering its output. Choosing Θ = W_Q effectively reduces the query projection to the identity matrix, demonstrating that a learned linear W_Q is redundant and can be absorbed by adjacent layers.

Motivated by this redundancy, the authors replace the linear query projection with a nonlinear residual function: Q(X) = (X + f_θ(X)) / 2. Here f_θ is a bottleneck MLP that first reduces the dimensionality from d to r = d/2 using a linear layer, applies RMSNorm, a GELU activation, a second linear layer back to d, and finally LayerNorm. The total number of learnable parameters in this module is d² + O(d), matching the parameter budget of the original W_Q. The 1/2 scaling follows the original theoretical analysis and ensures that the identity term provides a stable skip connection for gradient flow.

Experiments are conducted on a GPT‑3‑small‑style model (12 layers, 12 heads, model dimension d = 768) using the NanoGPT codebase and the OpenWebText dataset. Training runs for 60 k steps (≈29 B tokens) on a single RTX 5090 GPU, far beyond the Chinchilla‑optimal token count for a 124 M‑parameter model. Three configurations are compared: (1) the baseline with standard linear W_Q, (2) a widened MLP baseline where the feed‑forward hidden dimension is increased to 4.75 d (adding 12.5 % non‑embedding parameters), and (3) the proposed nonlinear query model. Hyper‑parameters such as learning rate and weight decay are tuned; the nonlinear model tolerates a higher learning rate (up to 3 × 10⁻³) and a lower weight decay (≈0.03) while remaining stable, whereas the baseline diverges at weight decay 0.05.

Results show that the nonlinear query model achieves a validation loss of 2.915 at step 59 k, a 1.40 % relative improvement over the baseline loss of 2.956. The widened MLP baseline improves only 0.98 % despite having 12.5 % more parameters. The gain from the nonlinear query is most pronounced during the warm‑up phase, dips slightly in the middle of training, and re‑emerges in the final phase. Moreover, the nonlinear model remains stable under aggressive learning‑rate and weight‑decay settings that cause the baseline to diverge, confirming the stabilizing effect of the identity anchor.

The authors acknowledge several limitations: experiments are limited to a single model scale and a single random seed; the hyper‑parameter space (normalization choices, activation functions, bottleneck ratio) is only sparsely explored; inference speed and memory overhead are not measured; and extensions to nonlinear K and V projections are left for future work. They also note that the bottleneck MLP introduces a serial dependency before attention, which may require custom kernel implementations for efficient deployment.

Future directions include systematic exploration of alternative normalizations (e.g., removing LN/RMSNorm), testing other activations (ReLU, ReLU²), applying similar residual nonlinearities to K and V, scaling experiments on larger models and multiple seeds, integrating the approach with LoRA‑style adapters for fine‑tuning, and fusing the bottleneck MLP with the K/V projections to reduce overall FLOPs.

In summary, the paper provides a theoretically grounded yet practically simple modification: replace the redundant linear query projection with a residual nonlinear MLP anchored to the identity. This change preserves the parameter budget, improves validation loss, and enhances training stability on a standard language‑model benchmark. The results suggest that even at modest scales, introducing nonlinearity into the query pathway yields measurable benefits, and the approach opens a new line of inquiry into more expressive, yet efficient, transformer architectures.


Comments & Academic Discussion

Loading comments...

Leave a Comment