ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs
Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights and eliminate dequantization overhead using Bit-serial LUT-based GEMM operations. HLQ significantly improves model accuracy under low-bit settings and achieves performance comparable to QAT methods without any retraining of the weights. Moreover, an optimized quantization pipeline is integrated into ELUTQ, enabling it to complete the quantization of LLaMA 3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ designs high-performance kernels to support end-to-end inference. Our 2-bit LLaMA3.1-8B achieves 1.5x speedup over AWQ on RTX 3090. Code is available at https://github.com/Nkniexin/ELUTQ.
💡 Research Summary
The paper introduces ELUTQ, a comprehensive framework for quantizing large language models (LLMs) to run efficiently on edge devices. The authors identify two major shortcomings of existing hardware‑friendly quantization methods: (1) uniform quantization poorly matches the bell‑shaped distribution of weights, especially at very low bit‑widths (2‑3 bits), leading to significant approximation error; and (2) low‑bit weights must be de‑quantized to 8‑bit or FP16 before matrix multiplication, incurring substantial unpacking overhead that can negate the speed benefits of quantization.
To address these issues, ELUTQ proposes a novel quantization format called Hierarchical Linear Quantization (HLQ). HLQ is a non‑uniform scheme that represents a q‑bit weight as a linear combination of q binary vectors (bit‑planes) multiplied by learned scale factors and a zero‑point. The method builds a codebook of all 2^q binary patterns, then selects for each weight the pattern that minimizes the L2 distance to the original value (bit‑pattern selection). After fixing the binary pattern, HLQ updates the continuous scale and zero‑point parameters via a closed‑form least‑squares solution (linear reconstruction). This alternating optimization is performed group‑wise, enabling massive parallelism and low memory footprint. Experiments show that HLQ reduces the mean‑squared quantization error from the 1e‑4 level (uniform) to the 1e‑6 level, a two‑order‑of‑magnitude improvement.
A second key contribution is the integration of HLQ with a bit‑serial lookup‑table (LUT) based GEMM kernel. Instead of de‑quantizing weights, the kernel stores all possible dot‑products between an activation vector and a single‑bit weight plane in a small LUT. The q‑bit weight matrix is decomposed into q binary planes; inference proceeds by performing q table lookups per output element followed by accumulation. This eliminates the de‑quantization step entirely, reduces memory traffic, and yields linear latency scaling with bit‑width. The authors implement pure C++ kernels for both CPUs (using SIMD) and GPUs (using CUDA warp‑level primitives), achieving high throughput across a range of edge hardware.
ELUTQ also redesigns the quantization pipeline to be memory‑efficient. By reshaping weights into groups and streaming them, the entire LLaMA 3.1‑70B model can be quantized using only 48 GB of GPU memory and 64 GB of CPU RAM, completing in roughly 40 hours. This contrasts sharply with prior methods such as Quip# that require >1 TB of CPU memory and >200 hours.
For further accuracy gains, the authors add an efficient fine‑tuning stage tailored to HLQ. It consists of (a) block‑wise reconstruction, where each linear layer’s integer weights, scales, and zero‑points are optimized to minimize the block output error, and (b) end‑to‑end tuning, where the whole model is fine‑tuned with the integer weights fixed, adjusting only the scales and zero‑points. This two‑stage process improves accuracy without introducing extra parameters or large computational overhead, achieving performance comparable to quantization‑aware training (QAT) methods that require retraining.
Experimental results cover LLaMA 3.1‑8B and LLaMA 3.1‑70B models at 2‑bit and 3‑bit precision. The 2‑bit LLaMA 3.1‑8B quantized with ELUTQ runs 1.5× faster on an RTX 3090 than the state‑of‑the‑art uniform method AWQ, while delivering 1–2 % higher accuracy. Compared with codebook‑based non‑uniform methods, HLQ’s accuracy is slightly lower but its hardware friendliness and inference speed are superior. The 70B model quantization demonstrates the pipeline’s scalability, completing with modest hardware resources.
In summary, ELUTQ unifies a high‑accuracy non‑uniform quantization scheme (HLQ) with de‑quantization‑free LUT‑based matrix multiplication, delivering a practical solution for deploying large LLMs on resource‑constrained edge devices. The framework achieves a rare combination of quantization accuracy, inference speed, and low memory requirements, and opens avenues for further extensions to other bit‑widths and specialized edge accelerators.
Comments & Academic Discussion
Loading comments...
Leave a Comment