Benford's Law as a Distributional Prior for Post-Training Quantization of Large Language Models
The rapid growth of Large Language Models (LLMs) intensifies the need for effective compression, with weight quantization being the most widely adopted technique. Standard uniform quantizers assume that parameters are evenly distributed, an assumption at odds with the highly skewed distributions observed in practice. We propose Benford-Quant, a simple, data-free non-uniform quantizer inspired by Benford’s Law, which predicts that leading digits follow a logarithmic distribution. Benford-Quant replaces the uniform grid with a log-spaced codebook, dedicating more resolution to the frequent small-magnitude weights. We provide both theoretical intuition and empirical evidence: (i) weights in transformer transformational layers adhere closely to Benford statistics, while normalization layers systematically deviate; (ii) on Small Language Models (SLMs), Benford-Quant consistently improves perplexity, reducing 4-bit perplexity on Gemma-270M by more than 10%; and (iii) on larger LLMs, it remains competitive, with differences explained by over-parameterization effects. Our results indicate that incorporating a Benford-inspired prior into quantization grids is a low-cost modification that yields accuracy gains in aggressive few-bit regimes. Although it is not able to surpass the state of the art in tasks such as perplexity and LAMBADA, the Benford-Quant approach can be hybridized with other quantization methods-such as SmoothQuant and Activation-Aware Quantization-without major pipeline modification, potentially improving their performance.
💡 Research Summary
The paper addresses the growing need for compressing large language models (LLMs) by proposing a data‑free, non‑uniform post‑training quantization method called Benford‑Quant (BENQ). The authors observe that weights in transformer “transformational” layers (linear, attention, feed‑forward) exhibit first‑digit distributions that closely follow Benford’s Law, whereas normalization layers (LayerNorm) and embeddings do not. This dichotomy is theoretically justified: multiplicative updates during training and repeated matrix multiplications cause the logarithms of weight magnitudes to become approximately uniformly distributed, a condition that yields Benford‑like leading digits.
BENQ replaces the conventional uniform quantization grid with a log‑spaced codebook containing 2^B symmetric levels. More resolution is allocated to small‑magnitude weights, which are far more frequent under the Benford prior. Quantization proceeds in a group‑wise fashion: each block of weights is normalized by its maximum absolute value, then each normalized element is mapped to the nearest codebook level. The dequantization simply looks up the level and rescales by the block’s scale. Crucially, the method is selective: only the transformational layers are quantized with the log‑spaced grid, while LayerNorm and embedding parameters remain in FP16 to preserve stability.
Experiments span models from 270 M to 72 B parameters (Gemma, OPT, BLOOM, Qwen). On small models, BENQ reduces 4‑bit perplexity by more than 10 % (e.g., Gemma‑270M: 39.5 → 32.3). On mid‑size and large models it remains competitive with uniform round‑to‑nearest (RTN) and with more sophisticated methods such as GPTQ. An ablation comparing the Benford‑inspired log grid to a generic non‑uniform linear grid shows that the logarithmic spacing is essential for the observed gains. The authors also note that BENQ can be combined with activation‑aware schemes like SmoothQuant, offering a drop‑in improvement without heavy calibration.
Limitations include diminishing returns for very large models where weight spectra flatten, and the method does not surpass state‑of‑the‑art results on tasks like LAMBADA. Hardware implementation of the log‑spaced codebook may require careful trade‑offs between precision and efficiency. Nonetheless, BENQ provides a principled, low‑overhead way to align quantization granularity with the natural statistical distribution of transformer weights, delivering notable accuracy recovery in aggressive few‑bit regimes.
Comments & Academic Discussion
Loading comments...
Leave a Comment