LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.
💡 Research Summary
The paper tackles a pressing problem in the deployment of large language models (LLMs): how to combine the emerging microscaling (MX) quantization format with activation outlier mitigation techniques without sacrificing accuracy. MX quantization partitions tensors into small blocks, each with its own dynamic scaling factor, which is highly effective for modern low‑precision formats such as FP4 or INT4. However, prior work that simply applies global rotation or Hadamard transforms before MX quantization suffers severe performance drops because the block‑wise scaling conflicts with the global mixing of channels. Existing remedies restrict the transformation to block‑diagonal matrices, thereby preventing cross‑block redistribution of activation mass and limiting outlier suppression.
Theoretical contribution
The authors formalize a general affine transformation (T(x)=Ax+v) and define a mean‑squared‑error (MSE) metric for the quantized output. They prove an upper bound (Theorem 3.3) showing that the error depends on two factors: the spectral norm of the inverse transformation (|A^{-1}|{2}) and the average across MX blocks of the expected maximum absolute value after transformation. For sub‑Gaussian activation distributions, the bound further decomposes into a term involving the sub‑Gaussian norm of the centered transformed activations and a logarithmic dependence on block size. This analysis reveals a trade‑off: making (A) well‑conditioned (small (|A^{-1}|{2})) can increase per‑block maxima, while reducing block‑wise maxima may worsen the conditioning. Consequently, a naïve block‑diagonal approach is only a special case and can be sub‑optimal when cross‑block mixing is beneficial.
Method – LATMiX
Motivated by the theory, LATMiX learns full‑rank affine transformations rather than restricting to orthogonal or block‑diagonal matrices. The bias term (v) is set to (-A\mu) (where (\mu) is the activation mean) so that the transformed mean is zero, eliminating the first term of the bound. The matrix (A) is parameterized via free‑form LU and QR decompositions, allowing gradient‑based optimization with standard deep‑learning toolkits. The loss combines:
- A distillation term that forces the MX‑quantized, transformed model to match the full‑precision teacher’s logits, and
- A volume‑preserving regularizer ((\det A - 1)^2) to keep the transformation invertible and numerically stable.
After training, the learned transformation can be folded into the linear layer weights (e.g., (W’ = A^{-1} W A)) and biases (if present) without any runtime overhead. When biases are absent, the folding incurs only negligible extra cost.
Empirical evaluation
Experiments cover three popular LLM families (Llama‑3.2‑1B, Llama‑2‑7B, Mistral‑7B) and a suite of seven zero‑shot benchmarks (BIG‑Bench, GSM‑8K, ARC‑E, HellaSwag, Wikitext‑2, etc.). MX quantization is applied in FP4 and INT4 modes with block sizes (B = 16, 32, 64). Baselines include: (i) no transformation, (ii) global Hadamard, (iii) block‑diagonal Hadamard, (iv) learned rotation matrices (optimized on the Stiefel manifold as in prior work). Results show that LATMiX consistently outperforms all baselines, achieving 0.5–1.2 percentage‑point gains in accuracy and up to a 12 % reduction in perplexity for larger block sizes where other methods collapse. Detailed analysis of block‑wise MSE (Figure 2c) demonstrates that LATMiX uniformly lowers error across all blocks, whereas rotation‑based methods produce uneven error profiles and block‑diagonal Hadamard only helps early blocks. Ablation studies confirm the importance of the bias term and the volume‑preserving regularizer.
Significance and limitations
LATMiX provides the first principled framework that jointly respects MX’s block‑wise scaling and the statistical properties of activations. By learning full affine transformations, it unlocks cross‑block energy redistribution while maintaining invertibility, leading to superior quantization robustness. The method incurs no inference‑time penalty after folding, making it attractive for production deployments on hardware that supports MX formats (e.g., ARM, Intel, NVIDIA). Limitations include the need for a calibration dataset to learn the transformation and the scaling of LU/QR parameters for extremely large models (>100 B parameters), which may demand additional memory or specialized training tricks. Future work could explore parameter sharing, low‑rank approximations, or meta‑learning to reduce the calibration burden.
In summary, LATMiX advances the state of the art in low‑bit LLM quantization by delivering a theoretically grounded, practically effective, and deployment‑friendly solution for microscaling formats.
Comments & Academic Discussion
Loading comments...
Leave a Comment