LittleBit: Ultra Low-Bit Quantization via Latent Factorization
The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a persistent challenge. In this paper, we introduce LittleBit, a novel framework for extreme LLM compression. We target quantization rates as low as $0.1$ bits per weight (BPW), achieving a memory reduction of approximately $31\times$, which effectively compresses Llama2-13B to under $0.9$ GB. We represent weights via low-rank latent matrix factorization and subsequently binarize the resulting factors. To counteract the information loss inherent to such drastic precision reduction, we integrate a multi-scale compensation mechanism that learns importance parameters across row, column, and latent dimensions. Two primary contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and Residual Compensation to minimize approximation errors. Extensive experiments confirm the superiority of LittleBit in the sub-1-bit domain; for instance, our method at $0.1$ BPW surpasses the performance of leading techniques operating at $0.7$ BPW on Llama2-7B. We establish a new size-performance trade-off – unlocking a potential $11.6\times$ inference speedup relative to FP16 – and render powerful LLMs practical for resource-constrained environments. Our code is available at https://github.com/SamsungLabs/LittleBit.
💡 Research Summary
LittleBit tackles the pressing problem of deploying large language models (LLMs) on memory‑constrained hardware by pushing quantization into the sub‑1‑bit regime. The authors observe that weight matrices in Transformers exhibit strong low‑rank structure, which motivates a factorization‑first approach: each weight matrix W is approximated as W ≈ U Vᵀ, where U ∈ ℝ^{d_out×r} and V ∈ ℝ^{d_in×r} with r much smaller than the original dimensions. After factorization, both U and V are binarized to ±1, and three sets of learnable FP16 scaling factors are introduced: a row scale h ∈ ℝ^{d_out}, a column scale g ∈ ℝ^{d_in}, and a latent scale ℓ ∈ ℝ^{r}. The effective weight used in the forward pass becomes cW_{pri}=diag(h)·U_sign·diag(ℓ)·V_signᵀ·diag(g). This representation replaces a large high‑precision GEMM with two small binary matrix multiplications and element‑wise scaling, dramatically reducing both memory footprint and compute cost.
A major obstacle for such an aggressive quantization is the initialization of the highly constrained parameters. To address this, the paper proposes Dual‑Sign‑Value‑Independent Decomposition (Dual‑SVID). First, a truncated SVD of the original weight yields low‑rank factors U′ and V′. The signs of these factors become the binary components U_sign and V_sign. The magnitudes |U′| and |V′| are each approximated by a rank‑1 decomposition, providing initial estimates for the row scale h₀, column scale g₀, and latent scale ℓ₀ (the element‑wise product of the two rank‑1 latent vectors). This initialization ensures that the initial effective weight cW_{pri,0} closely matches W, stabilizing quantization‑aware training (QAT).
Even with Dual‑SVID, the low‑rank binary approximation can leave a non‑trivial residual error, especially at extreme compression rates such as 0.1 bits per weight (BPW). LittleBit therefore introduces Residual Compensation: a parallel secondary path with the same factorized‑binary structure learns to model the residual W_res = W − cW_{pri,0}. This secondary path, also initialized via Dual‑SVID on the residual, adds its own binary factors, scales, and latent scale. The final effective weight is the sum cW = cW_{pri} + cW_{res}. By reallocating the bit budget from a single higher‑rank approximation to two lower‑rank paths, the method improves expressive power without increasing overall parameter count.
Experiments span Llama2‑13B, Llama2‑7B, and a 32‑billion‑parameter model, evaluating bit‑widths from 0.1 BPW up to 0.7 BPW. At 0.1 BPW, LittleBit achieves a perplexity of 4.88 on WikiText‑2, outperforming the prior state‑of‑the‑art sub‑1‑bit method STBLLM (which degrades sharply below 0.55 BPW). The 13B model is compressed to under 0.9 GB (≈31× reduction) and shows an estimated 11.6× inference speedup versus FP16. The 32B model at 0.3 BPW retains performance comparable to 0.7 BPW baselines, confirming scalability. Ablation studies demonstrate that removing Dual‑SVID leads to unstable training, while omitting the residual path causes a steep performance drop at the lowest bit rates.
The paper acknowledges limitations: the choice of latent rank r is critical and currently hand‑tuned; hardware support for binary matrix multiplication is still emerging; and the approach is applied only to linear layers, leaving attention‑score quantization as future work. Nonetheless, LittleBit establishes a practical pathway to run powerful LLMs on edge devices or low‑memory servers by achieving ultra‑low‑bit quantization (down to 0.1 BPW) without sacrificing accuracy.
Comments & Academic Discussion
Loading comments...
Leave a Comment