A Comprehensive Evaluation on Quantization Techniques for Large Language Models

A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is rapidly evolving. Though many papers report breakthrough results, they are often evaluated under different settings because a method typically contains multiple components. Analyzing connections among existing methods is important for deeper understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations under the same conditions for fair comparison. To our knowledge, such a fair and extensive investigation remains critically underexplored. To better understand connections, first, we decouple published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. The former is a preprocessing step that reduces outlier impact by flattening the data distribution; the latter offsets quantization errors to improve performance. Second, we evaluate and analyze the impact of different settings, including granularity and symmetry. Third, we analyze and evaluate the latest MXFP4 and NVFP4 data formats and their performance. Our experiments first demonstrate that optimized rotation and scaling yield the best pre-quantization performance, and that combining low-rank compensation with GPTQ can occasionally outperform GPTQ alone for error mitigation. Second, finer granularity improves performance but increases storage overhead. Third, we find that scaling-factor format and precision greatly affect FP4 performance, and that rotation-based strategies effective for INT4 offer limited gains for MXFP4 and NVFP4, motivating further study.


💡 Research Summary

This paper presents a systematic review and extensive empirical evaluation of post‑training quantization (PTQ) techniques for large language models (LLMs). Recognizing that many recent works combine multiple components, the authors first decompose every method into two conceptual stages: (1) pre‑quantization transformation, which preprocesses weights and activations to mitigate outliers (via shifting, scaling, rotation, or combinations thereof); and (2) quantization error mitigation, which compensates for the loss introduced by quantization (using GPTQ, OBQ, low‑rank compensation, etc.).

Using a unified experimental pipeline, the authors benchmark a broad set of state‑of‑the‑art methods on the W4A4 (INT4) configuration, which is the most aggressive and widely studied precision setting. Their results show that the best pre‑quantization strategy combines optimized rotation with channel‑wise scaling, flattening the data distribution and dramatically reducing quantization error. In the error‑mitigation stage, low‑rank compensation added to GPTQ occasionally outperforms GPTQ alone, indicating that a lightweight corrective sub‑space can capture residual errors that GPTQ’s greedy selection misses.

The study also investigates quantization granularity (per‑tensor, per‑channel, per‑group) and symmetry (symmetric vs. asymmetric). Finer granularity consistently improves accuracy but incurs higher storage overhead, confirming the classic accuracy‑efficiency trade‑off. Asymmetric quantization yields substantial gains for activations but only marginal benefits for weights, supporting the common practice of applying asymmetry primarily to activations.

A novel contribution is the analysis of the newest FP4 formats—MXFP4 and NVFP4—supported by NVIDIA’s Blackwell GPUs. While INT4‑oriented rotation‑based methods boost performance for integer quantization, they provide limited improvement for FP4. Instead, the format and precision of the scaling‑factor representation dominate FP4 performance, suggesting that FP4 requires dedicated scaling‑factor design rather than a direct transfer of INT4 tricks.

Overall, the paper offers a fair, reproducible benchmark suite, clarifies the relationships among existing techniques, and provides actionable recommendations: (i) employ rotation + scaling for pre‑processing; (ii) consider low‑rank correction alongside GPTQ; (iii) choose granularity based on storage constraints; and (iv) develop FP4‑specific scaling‑factor schemes. These insights advance the practical deployment of LLMs on resource‑constrained hardware and outline promising avenues for future quantization research.


Comments & Academic Discussion

Loading comments...

Leave a Comment