Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate performance across both prefill and decoding stages, finding that the structured transform consistently outperforms dense projections as complexity increases. Our findings indicate that replacing dense projections with structured transforms allows for more compute-efficient architectures that achieve lower loss than dense models at an equivalent training budget.


💡 Research Summary

The paper tackles a largely overlooked source of inefficiency in modern Transformer architectures: the dense output projection in multi‑head attention (MHA). This projection, which mixes the concatenated head outputs back into the model dimension, requires a (d_{\text{model}}\times d_{\text{model}}) weight matrix, contributing roughly 25 % of the parameters in each attention block and incurring (O(d_{\text{model}}^{2})) FLOPs. As models scale, this becomes a significant bottleneck in both memory consumption and compute cost.

To address this, the authors replace the learned dense projection with a fixed, parameter‑free Walsh‑Hadamard Transform (WHT) followed by a learned diagonal scaling vector (\alpha) and bias (\beta). The WHT is an orthogonal matrix composed of (\pm1) entries; after normalisation it satisfies (H^{\top}H=I) and preserves the (\ell_{2}) norm of its input. The forward operation can be written as
\


Comments & Academic Discussion

Loading comments...

Leave a Comment