BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language model (LLM) inference is often bounded by memory footprint and memory bandwidth in resource-constrained deployments, making quantization a fundamental technique for efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. Fundamentally, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using approximate second-order information while progressively compensating quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85% GSM8K accuracy (vs. 90.83% at 16-bit). Moreover, we provide theoretical analysis showing that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. Code: github.com/KingdalfGoodman/BPDQ.


💡 Research Summary

The paper tackles the long‑standing problem of severe quality loss when post‑training quantization (PTQ) pushes large language models (LLMs) into the 2‑ to 3‑bit regime. Existing PTQ methods rely on a “shape‑invariant” quantization grid: every weight group shares the same relative spacing (e.g., the four uniformly spaced values of UINT2). While this design simplifies hardware implementation, it severely restricts the feasible set of quantized weights under the output‑aligned objective (minimizing ‖WX – cWX‖). The authors argue that the degradation observed at ultra‑low bit‑widths is not a failure of the optimization objective itself but a mismatch between that objective and the rigid grid.

To overcome this, they introduce Bit‑Plane Decomposition Quantization (BPDQ). The core idea is to construct a variable quantization grid for each group by decomposing the weight matrix into binary bit‑planes and learning group‑wise scalar coefficients. Formally, a quantized weight block is expressed as

cW = REP(C₀) + ∑_{i=1}^k REP(C_i) ⊙ B_i,

where B_i∈{0,1} are the selected bit‑planes, C_i are scalar coefficients repeated across the group, and ⊙ denotes element‑wise multiplication. The number of bit‑planes k determines the effective bit‑width (k=1 for 2‑bit). This representation allows each group to have its own spacing pattern, breaking the shape‑invariance constraint and expanding the feasible solution space (proved in Appendix A).

The method proceeds in three stages:

  1. Initialization – The full‑precision weights are first quantized to 8‑bit integers, then decomposed into bit‑planes. The k most significant planes are retained as the initial B_i; the remaining LSBs are discarded, incurring only a small truncation error. With B_i fixed, the scalar coefficients C_i are obtained in closed form by solving a weighted least‑squares problem under the Hessian metric H = XXᵀ (X are calibration inputs). This step yields a grid that is already optimal for the current bit‑planes.

  2. Iterative Refinement – For each group, BPDQ alternates:

    • Bit‑plane update – Keeping C_i fixed, each column is examined exhaustively over the 2^k possible binary vectors b. The candidate value v(b) = C₀ + ∑ C_i·b_i is evaluated against the current working column using the Hessian‑induced error norm; the best b* is selected, and the column is quantized. This operation is performed column‑wise with the same triangular error‑propagation used in GPTQ, ensuring that the remaining free coordinates are updated consistently.
    • Coefficient refitting – After all columns in the group have been updated, the scalar coefficients are recomputed by solving the same weighted LS problem (now with the new B_i). This step readjusts the variable grid to better match the original weights.
  3. Delta Correction – Refitting C_i changes the quantized block from cW_old to cW_new, breaking the accumulated error‑propagation state E. To restore consistency, the authors compute a correction ΔE that satisfies ΔE·U_loc = cW_old – cW_new, where U_loc is the local Cholesky factor of H for the group. Adding ΔE to the stored error vectors guarantees that subsequent updates remain on the same Hessian‑induced manifold. Appendix B provides a formal proof of equivalence.

The authors embed BPDQ into the GPTQModel library, enabling direct comparison with GPTQ and the recent AWQ method. Experiments span several open‑source LLM families: Qwen‑3 (0.6 B to 72 B), Qwen‑2.5 (7 B and 72 B), and Mistral‑3 (3 B, 8 B). Benchmarks include GSM8K, MATH500, ARC‑C, BoolQ, HellaSwag, MMLU, and LongBench. Key results:

  • 72‑B Qwen‑2.5 quantized to 2‑bit achieves 83.85 % GSM8K accuracy, compared with 90.83 % for full‑precision FP16. GPTQ and AWQ collapse below 41 % under the same budget.
  • The quantized 72‑B model fits into 22.69 GB VRAM, allowing inference on a single RTX 3090. A custom bit‑plane lookup‑table kernel yields low‑latency decoding suitable for interactive generation.
  • Activation‑statistics analysis shows that BPDQ naturally preserves outlier activations that are critical for LLM performance, unlike many distribution‑aware PTQ methods that must explicitly mask outliers.
  • Across all model sizes, BPDQ consistently outperforms GPTQ and AWQ in the 2‑ and 3‑bit regimes, while matching or exceeding them at 4‑bit.

The theoretical contributions are twofold: (1) Demonstrating that a variable grid strictly enlarges the set of attainable quantized weights, thereby reducing the lower bound on reconstruction error; (2) Proving that the iterative bit‑plane/coefficients update together with delta correction remains optimal under the Hessian‑induced quadratic objective, effectively extending the “nearest‑plane” interpretation of GPTQ to a richer, adaptive lattice.

In summary, BPDQ offers a principled, hardware‑friendly pathway to ultra‑low‑bit LLM deployment. By decoupling the quantization grid from a fixed shape and aligning the optimization process with second‑order information, it achieves high fidelity even at 2‑bit precision, dramatically lowering memory requirements without sacrificing practical performance. This work is likely to influence both academic research on quantization theory and industry practice for cost‑effective LLM serving.


Comments & Academic Discussion

Loading comments...

Leave a Comment