WUSH: Near-Optimal Adaptive Transforms for LLM Quantization
Quantizing LLM weights and activations is a standard approach for efficient deployment, but a few extreme outliers can stretch the dynamic range and amplify low-bit quantization errors. Prior transform-based mitigations (e.g., Hadamard rotations) are fixed and data-agnostic, and their optimality for quantization has remained unclear. We derive closed-form optimal linear blockwise transforms for joint weight-activation quantization under standard RTN AbsMax-scaled block quantizers, covering both integer and floating-point formats. The resulting construction, WUSH, combines a Hadamard backbone with a data-dependent second-moment component to form a non-orthogonal transform that is provably near-optimal for FP and INT quantizers under mild assumptions while admitting an efficient fused GPU implementation. Empirically, WUSH improves W4A4 accuracy over the strongest Hadamard-based baselines (e.g., on Llama-3.1-8B-Instruct in MXFP4, it gains +2.8 average points with RTN and +0.7 with GPTQ) while delivering up to 6.6$\times$ per-layer throughput over BF16 via FP4 MatMul. Source code is available at https://github.com/IST-DASLab/WUSH.
💡 Research Summary
Quantizing large language models (LLMs) to ultra‑low precision (e.g., FP4 or INT4) is hampered by a handful of extreme outliers in weights and activations. These outliers inflate the dynamic range, forcing the quantization scale to be large and dramatically increasing rounding error. Prior work has mitigated this problem by applying fixed, data‑agnostic orthogonal transforms—most commonly Hadamard rotations—either globally or block‑wise. While effective, such transforms do not adapt to the actual statistics of a given layer, and their optimality for quantization has never been formally established.
WUSH (Weighted‑U‑S‑Hadamard) addresses this gap by deriving a closed‑form, data‑dependent linear transform that is provably near‑optimal for standard round‑to‑nearest (RTN) AbsMax‑scaled block quantizers. The authors formulate the quantization loss for a single block as the Frobenius norm of the difference between the quantized output and the exact output. By assuming unbiased stochastic quantization, they split the loss into two non‑negative terms that can be minimized independently. The key insight is that each term can be expressed as an expectation over a transformed random variable y = W′ᵀ x, where W′ and X′ are lower‑triangular Cholesky factors of the second‑moment matrices of the weights and calibration activations, respectively.
The optimal transform for a block is then shown to be:
T_wush = H S^{‑½} Uᵀ W′ᵀ
T_xvsh = H S^{‑½} Vᵀ X′ᵀ
where H is a normalized Hadamard matrix (the sole data‑agnostic component), S is the diagonal matrix of singular values from the SVD of W′ᵀ X′ = U S Vᵀ, and the Cholesky factors capture the covariance structure of the data. Importantly, T_xvsh = T_wush^{‑⊤}, guaranteeing that the transformed weight and activation spaces are perfectly aligned.
The authors prove that for floating‑point (FP) block quantizers the above construction exactly minimizes the expected quantization error, and for integer (INT) quantizers it becomes asymptotically optimal as the block dimension grows. This theoretical result explains why Hadamard alone works reasonably well (it is the only data‑independent orthogonal factor) and why augmenting it with a data‑driven whitening/scaling step yields substantial gains.
Algorithm 1 details a practical pipeline. First, the second‑moment of the calibration activations is estimated (either directly from data or via the Hessian used in GPTQ). A Cholesky decomposition yields X′. The target output Y = Wᵀ X′ is computed, and for each block the SVD of the corresponding sub‑matrix of Y provides U, S, V. The block‑wise transform T_wush and the transformed weight block \bar{W}=H S^{½} Uᵀ are then assembled. The method can be combined with GPTQ: after constructing the transform, GPTQ is applied to the transformed weight block using the transformed Hessian, and error propagation is split into intra‑block and inter‑block updates to keep the schedule tractable. If GPTQ is not used, a simple RTN fallback quantizes \bar{W} directly.
Experimental evaluation spans several recent LLMs (Llama‑3.1‑8B‑Instruct, Llama‑2‑13B, etc.) and compares WUSH against the strongest Hadamard‑based baselines (plain Hadamard, calibrated Hadamard, WUS). In the MXFP4 (FP4) regime with RTN, WUSH improves W4A4 accuracy by an average of +2.8 points; with GPTQ it adds +0.7 points. Throughput measurements on a modern GPU show that the fused WUSH + FP4 MatMul kernel achieves up to 6.6× the per‑layer speed of BF16, while the overhead of the transform is comparable to that of a plain Hadamard kernel. Visualizations (Figure 1) illustrate how the transform reshapes the quantization error ellipsoids, aligning the major axis of the data distribution with regions of error reduction.
The paper also discusses practical considerations: the transform is block‑diagonal, so it can be pre‑computed per layer and stored alongside quantized weights; the only extra memory cost is the Hadamard matrix (which can be generated on‑the‑fly). The method does not require any fine‑tuning of the model after quantization, making it attractive for rapid deployment pipelines.
In summary, WUSH provides a mathematically grounded, data‑adaptive alternative to fixed orthogonal rotations. By blending a Hadamard backbone with a second‑moment whitening step, it attains provable near‑optimality for both floating‑point and integer quantizers, integrates cleanly with existing PTQ tools like GPTQ, and delivers substantial accuracy and speed gains on real‑world LLM workloads. This work bridges the gap between theory and practice in low‑bit LLM quantization and opens avenues for further research on larger block sizes, alternative quantization schemes, and hardware‑specific optimizations.
Comments & Academic Discussion
Loading comments...
Leave a Comment