ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms–Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$–that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard’s discrete ${+1, -1}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms’ continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU.


💡 Research Summary

ButterflyQuant tackles the longstanding challenge of extreme (2‑bit) quantization for large language models (LLMs) by replacing the fixed Hadamard rotations used in prior rotation‑based methods with learnable butterfly transforms. The authors first identify that different transformer layers—attention, early MLP, and late MLP—exhibit distinct outlier patterns in their activation distributions, which a single, data‑agnostic rotation cannot adequately suppress. While Hadamard matrices achieve the optimal worst‑case coherence μ = 1/√n, their discrete {+1, −1} entries prevent gradient‑based adaptation, leading to a one‑size‑fits‑all limitation.

ButterflyQuant introduces a structured orthogonal matrix factorized into log₂n sparse layers of 2×2 Givens rotations. Each layer contains n/2 independent rotations, parameterized by continuous angles θ∈ℝ. This parameterization guarantees orthogonality by construction, enables smooth back‑propagation, and requires only n·log₂n⁄2 learnable parameters—orders of magnitude fewer than full‑matrix approaches (e.g., SpinQuant). The computational cost remains O(n log n), identical to the fast Hadamard transform, allowing the reuse of existing optimized kernels with minimal overhead.

The paper proves that any Hadamard matrix of size 2ᵏ can be exactly expressed as a butterfly transform with specific angle choices (θ = π/4 for the base 2×2 block) and sign‑diagonal matrices, establishing that butterfly transforms are strictly more expressive than fixed Hadamard rotations.

Training proceeds on a tiny calibration set (128 samples). The authors apply the butterfly transform to both weights and activations (y = (WQᵀ)(Qx)), preserving the original output while redistributing activation magnitudes. To further aid quantization, they add a uniformity regularization term that encourages post‑transform activations to follow a smoother distribution, reducing the dynamic range that 2‑bit quantizers must represent. Convergence is achieved within minutes on a single GPU.

Empirical evaluation spans several state‑of‑the‑art LLMs (LLaMA‑70B, OPT‑66B, GPT‑NeoX). Compared with prior rotation‑based baselines (QuIP, QuaRot) and other PTQ methods, ButterflyQuant consistently yields higher downstream task accuracy, with improvements ranging from 1.2% to 2.0% absolute on standard benchmarks. The method also demonstrates negligible inference overhead (≈2‑3% extra compute) and maintains the O(n log n) runtime profile. For non‑power‑of‑two dimensions (e.g., 5120), the authors construct composite transforms using Kronecker products, preserving orthogonality and efficiency.

Limitations include dependence on the calibration data distribution—significant shifts may require re‑learning the angles—and the need for additional bookkeeping of per‑layer angle parameters, which modestly increases model storage. Nevertheless, the memory savings from 2‑bit quantization (4‑8× reduction) far outweigh these costs.

In summary, ButterflyQuant bridges the gap between fixed, theoretically optimal rotations and fully learnable orthogonal matrices. By leveraging a continuous, structured butterfly parameterization, it adapts to layer‑specific outlier characteristics while retaining orthogonal guarantees and computational efficiency, setting a new state‑of‑the‑art for ultra‑low‑bit LLM quantization.


Comments & Academic Discussion

Loading comments...

Leave a Comment