LoaQ: Layer-wise Output Approximation Quantization

LoaQ: Layer-wise Output Approximation Quantization
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A natural and intuitive idea in model quantization is to approximate each component’s quantized output to match its original. Motivated by this idea, most layer-wise post-training quantization (PTQ) methods focus on weight approximation at the linear-layer level. As a result, this local objective often yields insufficient approximations and practical deviations from the guiding intuition. Recent work has improved the approximation of linear-layer outputs within the layer-wise PTQ framework, but such refinements remain inadequate for achieving alignment with the full-model output. Based on a deeper understanding of the structure of mainstream LLMs, we propose LoaQ, which incorporates output-matching factors when quantizing linear layers within the layer-wise PTQ framework. It better aligns with this intuition and can feature a simple closed-form solution, making it orthogonal to existing techniques and readily integrable into existing quantization pipelines. Experiments on the LLaMA and Qwen model families demonstrate that LoaQ performs effectively in both weight-only and weight-activation quantization. By integrating seamlessly with existing quantization strategies, it further enhances overall quantization quality and shows strong potential to advance the frontier of post-training quantization.


💡 Research Summary

LoaQ (Layer‑wise Output Approximation Quantization) addresses a fundamental limitation of existing layer‑wise post‑training quantization (PTQ) methods for large language models (LLMs). Traditional layer‑wise PTQ focuses on weight approximation at the linear‑layer level, minimizing a loss of the form L(Q)=‖X(Q−W)‖²_F where X is the activation input. This local objective ignores the residual connections and RMSNorm layers that are ubiquitous in modern transformer blocks, leading to a mismatch between the quantized model’s intermediate outputs and the original model’s outputs.

The authors propose a three‑stage hierarchical approach that progressively aligns the quantized model with the original model at increasingly higher levels of granularity:

  1. Linear‑layer output approximation – A correction term C = Xᵀ(X′ − X) is added to the standard GPTQ loss, and the Hessian H = XᵀX is used to pre‑condition the weight matrix. By updating the weight as (I + H⁻¹C)·W, the problem reduces to the classic GPTQ formulation, allowing the use of existing GPTQ solvers without modification.

  2. Sub‑block output approximation – Each transformer sub‑block (self‑attention or MLP) consists of an RMSNorm, the sub‑block module, and a residual addition. The authors explicitly compensate for the residual‑induced error by adding H⁻¹Xᵀ(h′ − h) to the weight update, where h and h′ are the hidden states of the quantized and original sub‑blocks respectively. This aligns the entire sub‑block output rather than just the linear transformation.

  3. Normalized sub‑block output approximation – Because RMSNorm normalizes the sub‑block output before it feeds the next layer, the authors target the normalized outputs ρ(h + XQ) and ρ(h′ + X′W). By expressing the RMSNorm as a column‑wise root‑mean‑square scaling operator R(·) and decoupling the scaling from the quantized weight, the loss again collapses to a linear form amenable to GPTQ.

The final update rule is:
 f_W = (I + α H⁻¹C)·W + β H⁻¹XᵀΔh,
where α and β are tunable scaling factors that prevent over‑compensation. Empirically, α≈0.4–0.6 and β≈1.0 yield the best trade‑off between error reduction and numerical stability.

LoaQ’s algorithm (Algorithm 1) iterates over all layers, collects the necessary statistics (X, X′, h, h′, RMSNorm scaling), computes H, C, and Δh, applies the above update, and finally quantizes the corrected weight using the standard GPTQ routine. Because the correction is performed before quantization, LoaQ is fully compatible with any existing GPTQ‑based pipeline and incurs negligible additional computational cost.

The method is evaluated on several LLM families: LLaMA‑2 (7B, 13B, 70B), LLaMA‑3 (8B, 70B), and Qwen‑3 (8B, 14B, 32B). Experiments cover 2‑bit and 3‑bit channel‑wise weight‑only quantization as well as weight‑activation quantization, the latter combined with recent techniques such as Hadamard transforms and NeUQI. Evaluation metrics include perplexity on WikiText‑2 and C4, and zero‑shot accuracy on five benchmarks (ARC‑E, ARC‑C, PiQA, HellaSwag, Winogrande).

Results show that LoaQ consistently outperforms strong baselines (GPTQ, Qronos, GPT‑AQ). For example, on LLaMA‑2 70B, LoaQ reduces WikiText‑2 perplexity from 71.05 (GPTQ) to 41.62 and improves ARC‑C accuracy from 21.59 % to 19.97 % (lower is better for perplexity, higher for accuracy). Across all model sizes, LoaQ achieves 10–30 % lower perplexity and 2–5 % higher zero‑shot accuracy compared to the baselines. Ablation studies confirm that each of the three approximation stages contributes additively to the final performance gain, and that the RMSNorm‑based normalization step is crucial for deep models where residual‑induced errors would otherwise accumulate.

In summary, LoaQ introduces a principled, closed‑form correction framework that extends layer‑wise PTQ from weight‑only alignment to full sub‑block and normalized output alignment. Its compatibility with existing GPTQ implementations, modest computational overhead, and strong empirical gains make it a compelling addition to the PTQ toolbox, especially for quantizing very large transformer models where preserving output fidelity is critical.


Comments & Academic Discussion

Loading comments...

Leave a Comment