A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.


💡 Research Summary

This paper conducts a systematic case study of post‑training quantization (PTQ) techniques for large language models (LLMs) on Huawei’s Ascend NPU, a hardware platform that has received far less attention than NVIDIA GPUs in the quantization literature. The authors focus on reasoning‑oriented models—specifically the open‑source DeepSeek‑R1‑Distill‑Qwen series (1.5 B, 7 B, 14 B) and the larger QwQ‑32 B model—and evaluate four representative PTQ algorithms that span the spectrum from weight‑only compression to full weight‑activation quantization with rotation‑based transforms: AWQ, GPTQ, SmoothQuant, and FlatQuant.

Methodology
The study adopts a two‑tier selection strategy. First, weight‑only baselines AWQ and GPTQ are tested with group size 128 at 4‑bit and 3‑bit precision (W4A16g128, W3A16g128). Second, SmoothQuant is used as the primary 8‑bit weight‑activation baseline (W8A8KV8, where the KV cache is also quantized to 8‑bit). Finally, FlatQuant is employed both in its standard 8‑bit configuration (W8A8KV8) and an aggressive 4‑bit configuration (W4A4KV4) to probe the limits of rotation‑based quantization on Ascend. The authors benchmark on a comprehensive suite of reasoning tasks (AIME‑120, Math‑500, GSM8K, GPQA‑Diamond, LiveCodeBench) and QA tasks (ARC‑Easy/Challenge, HellaSwag, LAMBADA, PIQA, Winogrande). Evaluation metrics include perplexity, task‑specific accuracy, and average drop relative to BF16 baselines. Experiments are run on the Ascend 910B NPU and, for comparison, on an X2000 GPU‑class reference platform.

Key Findings

  1. Weight‑only 4‑bit quantization is viable for larger models – Both AWQ and GPTQ with 4‑bit weights (group size 128) retain most of the original performance on models ≥ 7 B, with average accuracy drops of ≤ 4 % across all reasoning benchmarks. The 1.5 B model suffers a larger drop, especially for AWQ, indicating that smaller models are more sensitive to the coarse quantization step.

  2. 3‑bit weight‑only quantization collapses performance – Across all model sizes, the 3‑bit configurations trigger severe degradation (often > 20 % drop), confirming that the quantization step becomes too coarse to preserve the delicate weight distribution needed for reasoning.

  3. 8‑bit weight‑activation quantization (SmoothQuant) is numerically stable – SmoothQuant’s W8A8KV8 variant stays within a 1‑3 % accuracy gap relative to BF16 on both reasoning and QA tasks. However, the NPU results are consistently a few percentage points lower than published GPU numbers, suggesting that Ascend’s fixed‑point arithmetic introduces modest additional noise.

  4. FlatQuant‑W8A8KV8 mirrors SmoothQuant performance – The rotation‑based method in its 8‑bit mode achieves comparable stability, indicating that the lack of specialized FHT kernels on Ascend does not hinder 8‑bit rotation‑based quantization when standard Triton‑compatible kernels are used.

  5. FlatQuant‑W4A4KV4 exhibits strong platform sensitivity – In the 4‑bit configuration, performance on QA benchmarks can be partially recovered through extensive hyper‑parameter tuning (group size, calibration sample count). However, long‑context reasoning tasks still suffer from “logic collapse,” where the model’s output becomes incoherent after a few hundred tokens. Larger models (QwQ‑32 B) are more robust than the 14 B and 7 B variants, but the issue remains pronounced.

  6. Real‑world INT8 deployment reveals kernel bottlenecks – A prototype INT8 inference path on Ascend 910B, using optimized INT8×INT8 matmul kernels, reduces latency for certain layers by ~10‑15 %. Nevertheless, dynamic quantization overhead (runtime scaling, zero‑point correction) accounts for > 30 % of total inference time, limiting end‑to‑end speed‑ups. The prototype also lacks full operator coverage, leaving further optimization opportunities.

  7. Absence of FHT‑based rotation methods – State‑of‑the‑art rotation techniques (QuaRot, SpinQuant, OstQuant) rely on Fast Hadamard Transform kernels that are not yet available on Ascend. Consequently, the study uses FlatQuant as a proxy to estimate an upper bound for rotation‑based quantization on NPUs.

Implications
The results suggest a practical deployment hierarchy for Ascend NPU:

  • 8‑bit quantization (W8A8KV8) is the safest and most broadly applicable choice, delivering near‑baseline accuracy with modest latency gains.
  • 4‑bit weight‑only quantization can be employed for models ≥ 7 B when memory constraints are tight, but developers must accept a small accuracy penalty and verify stability on a per‑model basis.
  • 4‑bit weight‑activation quantization is currently unsuitable for long‑context reasoning due to calibration instability and logic collapse.
  • Rotation‑based 4‑bit methods require dedicated kernel support (e.g., FHT) and extensive hyper‑parameter tuning before they become viable on Ascend.

Future Directions
The authors recommend several research avenues: (1) development of Ascend‑optimized Fast Hadamard Transform kernels to unlock high‑accuracy rotation‑based PTQ; (2) tighter integration of dynamic quantization into the NPU runtime to reduce overhead; (3) automated, hardware‑aware calibration pipelines that adapt group sizes and scaling factors per layer; and (4) co‑design of quantization algorithms that respect Ascend’s mixed‑precision accumulation constraints.

In summary, this case study provides the first comprehensive benchmark of PTQ methods for reasoning‑focused LLMs on Ascend NPU, highlighting both the promise of 8‑bit quantization and the current limitations of aggressive low‑bit schemes on this emerging hardware platform.


Comments & Academic Discussion

Loading comments...

Leave a Comment