ARB-LLM: Alternating Refined Binarizations for Large Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Language Models (LLMs) have greatly pushed forward advancements in natural language processing, yet their high memory and computational demands hinder practical deployment. Binarization, as an effective compression technique, can shrink model weights to just 1 bit, significantly reducing the high demands on computation and memory. However, current binarization methods struggle to narrow the distribution gap between binarized and full-precision weights, while also overlooking the column deviation in LLM weight distribution. To tackle these issues, we propose ARB-LLM, a novel 1-bit post-training quantization (PTQ) technique tailored for LLMs. To narrow the distribution shift between binarized and full-precision weights, we first design an alternating refined binarization (ARB) algorithm to progressively update the binarization parameters, which significantly reduces the quantization error. Moreover, considering the pivot role of calibration data and the column deviation in LLM weights, we further extend ARB to ARB-X and ARB-RC. In addition, we refine the weight partition strategy with column-group bitmap (CGB), which further enhance performance. Equipping ARB-X and ARB-RC with CGB, we obtain ARB-LLM$\text{X}$ and ARB-LLM$\text{RC}$ respectively, which significantly outperform state-of-the-art (SOTA) binarization methods for LLMs. As a binary PTQ method, our ARB-LLM$_\text{RC}$ is the first to surpass FP16 models of the same size. The code and models will be available at https://github.com/ZHITENGLI/ARB-LLM.

💡 Research Summary

Large language models (LLMs) have achieved remarkable performance across a wide range of natural‑language tasks, but their billions of parameters impose prohibitive memory and compute requirements for real‑world deployment. Binarization—representing each weight with a single bit—offers the most extreme compression, yet existing binary post‑training quantization (PTQ) methods suffer from two fundamental shortcomings. First, the distribution of binarized weights often deviates significantly from that of the original full‑precision weights, leading to a large quantization error. Second, LLM weight matrices exhibit noticeable column‑wise deviations, but most prior work only applies row‑wise scaling, ignoring this column bias.

The paper introduces ARB‑LLM, a novel 1‑bit PTQ framework specifically designed for LLMs. The core component, Alternating Refined Binarization (ARB), iteratively refines three key parameters: the mean offset μ, the row‑wise scaling factor α, and the binary matrix B. Starting from a standard sign‑based binarization, ARB computes the residual matrix R = W − (αB + μ), updates μ by adding the mean of R, then recomputes α and B to be optimal under the updated μ. This alternating update is repeated for T iterations; the authors prove (Theorem 1) that the Frobenius‑norm quantization error monotonically decreases with each iteration. A group mask M (bitmap) is incorporated to separate salient from non‑salient weights, allowing the algorithm to focus refinement on the most influential parameters.

While ARB already reduces the gap between binary and full‑precision models, the authors extend it in two directions to address practical LLM characteristics:

ARB‑X (Calibration‑aware ARB) – Recognizing that quantization error measured only on weights (L₁) does not capture the impact on model outputs, ARB‑X introduces a calibration dataset X. The loss now includes the discrepancy between the model’s activations on X before and after binarization. μ and α are updated using gradients that reflect this data‑driven loss, ensuring that the refined binary representation preserves the input‑output behavior of the original model. Experiments show that even a modest calibration set (128–512 samples) yields noticeable gains, especially for smaller models where data‑driven refinement compensates for limited capacity.
ARB‑RC (Row‑Column Refined ARB) – To handle column‑wise deviations, ARB‑RC adds a column‑wise scaling factor β alongside the row‑wise α. The alternating refinement now proceeds as μ → α → β → B, with closed‑form updates derived from setting the partial derivatives of the quantization error with respect to α and β to zero. This dual‑scaling scheme effectively aligns both row and column statistics of the binary approximation with those of the full‑precision weights. The authors report that ARB‑RC consistently outperforms ARB‑X on larger models (≥13 B parameters), where column variance is more pronounced.

Both extensions are combined with a refined bitmap strategy called Column‑Group Bitmap (CGB). Traditional binary PTQ methods use a salient bitmap to flag important weights and a separate magnitude‑based grouping to reduce storage. CGB merges these ideas: weights are first grouped by column, then within each column a salient bitmap selects the most critical elements, and a shared scaling factor is stored per group. This design dramatically reduces bitmap overhead while preserving the ability to treat important weights with higher fidelity.

Experimental Evaluation
The authors evaluate ARB‑LLM on several prominent LLM families: the OPT series (1.3 B–66 B), LLaMA‑2 (7 B–70 B), and LLaMA‑3 (8 B–70 B). Seven zero‑shot question‑answering benchmarks (e.g., BoolQ, ARC‑Easy, OpenBookQA) serve as the primary accuracy metric. Baselines include full‑precision FP16 models of the same size, as well as state‑of‑the‑art binary PTQ methods PB‑LLM and BiLLM, and mixed‑precision PTQ such as GPTQ and SmoothQuant.

Key findings:

Accuracy – ARB‑LLM RC (ARB‑RC + CGB) surpasses the same‑size FP16 baseline on all model scales, achieving up to 2.1 % absolute gain on a 13 B model and 0.9 % on a 66 B model. Compared with BiLLM, ARB‑LLM RC delivers 4.8 % (13 B) to 3.2 % (66 B) higher accuracy.
Memory Footprint – The binary weights require 1 bit per parameter; the additional row/column scaling vectors and CGB bitmap together occupy only a few kilobytes, yielding a total storage reduction of >75 % relative to 4‑bit or 8‑bit PTQ.
Inference Speed – When paired with a custom 1‑bit matrix‑multiply kernel, ARB‑LLM achieves roughly 1.8× speed‑up over FP16 inference on the same hardware, confirming that the theoretical compute reduction translates into practical gains.
Ablation – Removing the calibration step (ARB‑X) reduces accuracy by ~1 % on small models; omitting column scaling (ARB‑RC → ARB‑X) drops performance by 1–2 % on models larger than 13 B, highlighting the complementary nature of the two extensions.

Limitations and Future Work
The calibration‑aware variant depends on the availability of representative data; domain shift between calibration and downstream tasks can diminish its benefits. The column scaling currently uses a simple linear factor; exploring non‑linear or multi‑bit column transformations could further narrow the distribution gap. Finally, widespread deployment of 1‑bit inference requires hardware support (e.g., specialized bit‑serial MAC units), which remains an open engineering challenge.

Conclusion
ARB‑LLM presents a systematic solution to the long‑standing distribution shift problem in binary LLM quantization. By alternating refinement of mean, row, and column scaling parameters, and by integrating calibration data and a compact column‑group bitmap, the method dramatically improves binary PTQ accuracy while preserving the extreme memory savings of 1‑bit representation. The experimental results demonstrate that ARB‑LLM RC not only outperforms existing binary PTQ approaches but also exceeds full‑precision FP16 baselines, establishing 1‑bit quantization as a viable path for deploying massive LLMs in resource‑constrained environments.

ARB-LLM: Alternating Refined Binarizations for Large Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment