ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

We present \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rotation-domain adaptive quantization strategy based …

Authors: Edward J. Yoon

ITQ3 S: In terlea v ed T ernary Quan tization with T urb oQuan t High-Fidelit y 3-bit LLM Inference via Rotation-Domain Adaptive Quan tization Edw ard J. Y oon edwardyoon@apache.org Marc h 2026 Abstract W e presen t ITQ3 S (In terleav ed T ernary Quantization – Specialized), a nov el 3-bit weigh t quan tiza- tion format for large language mo dels (LLMs) that integrates T urb oQuant (TQ) , a rotation-domain adaptiv e quantization strategy based on the F ast W alsh-Hadamard T ransform (FWHT). Conv entional 3-bit quan tization metho ds suffer from catastrophic precision loss caused by hea vy-tailed w eight distribu- tions and in ter-c hannel outliers. ITQ3 S addresses this fundamen tal limitation b y pre-rotating the w eight space via FWHT prior to quantization, effectively spreading outlier energy across the entire vector and inducing a near-Gaussian distribution amenable to uniform ternary co ding. Critically , w e derive a mathematically rigorous dequantization procedure that in verts the FWHT exactly using a 256-point Inv erse W alsh-Hadamard T ransform fused into the CUD A shared-memory loading stage, ensuring zero-error round-trip fidelity betw een offline quantization and online inference. W e prov e that for any weigh t vector w ∈ R 256 pro cessed by our pip eline, the reconstruction satisfies ∥ ˆ w − w ∥ 2 ≤ ϵ q , where ϵ q is determined solely by the ternary quantization grid and is strictly smaller than any uniform 3-bit baseline under equal bit-budget constraints. Empirically , on the NVIDIA R TX 5090 (Blackw ell arc hitecture), ITQ3 S ac hieves p erplexity compet- itiv e with FP16 baselines while delivering throughput exceeding 1.5 × that of 4-bit alternatives, owing to optimized DP4A and T ensor Core scheduling in the interlea ved memory lay out. Our results establish ITQ3 S as a practical, mathematically grounded solution for high-fidelit y LLM deplo ymen t on consumer- grade hardware. 1 In tro duction The rapid gro wth of large language models – from 7-billion to 70-billion and b eyond – has created a widening gap b etw een mo del capability and practical deplo y abilit y . While state-of-the-art LLMs exhibit remark able reasoning abilities, their memory fo otprint in FP16 or BF16 precision exceeds what is feasible on even the most p ow erful consumer GPUs. A 70B-parameter mo del in FP16 requires appro ximately 140 GiB of memory; ev en the NVIDIA R TX 5090 with 32 GiB of VRAM cannot load it without quan tization. W eigh t quantization has emerged as the primary technique for bridging this gap. By enco ding weigh ts in reduced-precision formats (8-bit, 4-bit, or ev en 3-bit), the memory fo otprint can b e reduced b y factors of 2 × – 5 × , enabling previously inaccessible mo dels to run on consumer hardw are. How ever, aggressive quan tization in tro duces reconstruction error that degrades mo del quality , particularly at sub-4-bit precision. The transition from 4-bit to 3-bit quan tization is especially c hallenging. Empirically , 3-bit has been called the “breaking p oin t” for LLM logic: mo dels quantized naiv ely to 3 bits exhibit sharp p erplexity d egradation, hallucination, and loss of m ulti-step reasoning capabilit y . This degradation stems from tw o sources: 1. Hea vy-tailed w eight distributions : T ransformer w eigh t matrices contain outlier v alues whose magni- tude far exceeds the t ypical scale, forcing quantizers to spread lev els thinly across a wide dynamic range, w asting precision on rarely-o ccupied regions. 2. In ter-channel correlation : Structured correlation among w eight channels causes uniform quantization error to accumulate in semantically critical directions. 1 Existing mitigations – including GPTQ [1], A W Q [2], SqueezeLLM [3], and QuIP# [4] – address these issues through second-order Hessian correction, per-channel scaling, sparse outlier co ding, or randomized rotation, resp ectively . Ho w ever, none are purpose-built for maximizing fidelity on a single consumer GPU with the hard constraint of 3-bit storage and full CUD A kernel integration. Our Contribution. W e in tro duce ITQ3 S, whic h com bines: • A deterministic FWHT-based rotation that theoretically minimizes the ℓ ∞ norm of the w eight v ector b efore quan tization (Section 3). • An interle ave d ternary co ding scheme that pac ks 3-bit v alues into 32-bit w ords optimally for DP4A throughput (Section 4). • A fused 256-p oint In verse FWHT CUDA k ernel that reconstructs w eigh ts in shared memory with no off-c hip memory traffic p enalty (Section 5). • Empirical v alidation on the R TX 5090 demonstrating state-of-the-art p erplexity-throughput tradeoffs at 3-bit precision (Section 6). 2 Bac kground 2.1 Quan tization F undamen tals Let w ∈ R n b e a w eight vector to be quan tized to b bits. A uniform quant izer partitions the dynamic range [ w min , w max ] in to 2 b lev els: Q b ( w ) = ∆ ·  w ∆ + 1 2  , ∆ = w max − w min 2 b − 1 (1) The mean squared quantization error is E [( w − Q b ( w )) 2 ] ≈ ∆ 2 12 for uniformly distributed w . F or b = 3, ∆ is large enough that outliers – w eights with | w | ≫ σ w – incur reconstruction errors that dominate the signal. 2.2 T ernary Quan tization T ernary quan tization restricts w eigh ts to three v alues {− α, 0 , + α } for some scale α > 0, requiring only ⌈ log 2 3 ⌉ ≈ 1 . 585 bits p er weigh t in theory , though practical implemen tations use 2 bits. ITQ3 S extends this to true 3-bit precision by interlea ving tw o ternary sub-blo cks with shared scale metadata, achieving a net coding rate of exactly 3 bits/w eight. 2.3 W alsh-Hadamard T ransform The W alsh-Hadamard T ransform (WHT) of a v ector v ∈ R n (where n = 2 k ) is defined as: ˆ v = H n v , H n = 1 √ n  H n/ 2 H n/ 2 H n/ 2 − H n/ 2  , H 1 = [1] (2) The WHT is its own inv erse up to normalization: H − 1 n = H n (since H n H n = I for the normalized form), so: v = H n ˆ v (3) This self-inv erse prop erty is central to our dequan tization design. Computationally , the F ast WHT (FWHT) runs in O ( n log n ) using the butterfly decomp osition: ( u, v ) 7→ ( u + v , u − v ) (4) applied across log 2 n stages, each stage op erating on disjoin t pairs. 2 2.4 Related W ork QuIP and QuIP# [4] apply random orthogonal rotations (Kroneck er products of Hadamard matrices) to w eigh t matrices b efore quantization to “incoherify” the weigh ts. Our approach differs in that we (1) use a fixed deterministic 256-p oint FWHT matched to the hardware block size, (2) integrate the in verse transform directly into the CUD A MMQ kernel, and (3) target sp ecifically the 3-bit ternary regime rather than general sub-4-bit quan tization. LLM.in t8() [5] handles outliers by splitting computation in to FP16 (for outlier channels) and INT8 (for normal channels). This requires mask ed scatter-gather, which is exp ensive on consumer GPUs. ITQ3 S a v oids this b y absorbing outlier energy into the transform domain rather than handling them separately . SpQR [6] stores a small n umber of outlier w eights in higher precision alongside a low-bit compressed tensor. Our approach is complemen tary: the FWHT rotation reduces the fraction of outliers significantly enough that a uniform ternary grid suffices for all weigh ts. 3 Theoretical F oundation 3.1 Effect of FWHT on W eight Distributions Theorem 1 (Distribution Smoothing) . L et w ∈ R n b e a weight ve ctor with empiric al me an µ and varianc e σ 2 , and let w ′ = H n w b e its Walsh-Hadamar d tr ansform. If the entries of w ar e indep endent with b ounde d ℓ 4 norm, then by the Centr al Limit The or em for Walsh tr ansforms, the entries of w ′ c onver ge in distribution to N (0 , σ 2 ) as n → ∞ . Pr o of. Each entry w ′ k = P n − 1 j =0 ( − 1) ⟨ k,j ⟩ w j / √ n , where ⟨ k , j ⟩ denotes the inner pro duct of binary represen ta- tions. This is a sum of n terms with random signs ( − 1) ⟨ k,j ⟩ ; by the Lindeb erg–F eller CL T (since E [ w 4 j ] < ∞ and individual contributions v anish), w ′ k d − → N (0 , σ 2 ). Corollary 1 (Outlier Suppression) . L et w max = ∥ w ∥ ∞ . After tr ansformation, ∥ H n w ∥ ∞ ≤ ∥ w ∥ 1 / √ n ≤ √ n · w max , but in exp e ctation E [ ∥ H n w ∥ ∞ ] = O ( σ √ log n ) when w has sub-Gaussian entries. F or n = 256 , this yields an exp e cte d ℓ ∞ r e duction factor of w max / ( σ √ log 256) ≈ w max / (3 σ ) . In practice, transformer weigh t matrices are not p erfectly independent – they exhibit structured outliers at sp ecific c hannels. Nev ertheless, the FWHT mixes all n entries, so even a single large outlier w j = M ≫ σ con tributes only M / √ n to each transformed co efficient, distributing its energy uniformly . 3.2 Quan tization Error Bound Theorem 2 (ITQ3 S Reconstruction Bound) . L et w ∈ R 256 and let ˆ w denote the ITQ3 S r e c onstruction. Define the ternary quantization grid with sc ale d k and zer o-p oint z k p er blo ck of 256, so that: Q T ( x ; d k , z k ) = d k · arg min q ∈{− 1 , 0 , 1 } | d k q − ( x − z k ) | (5) Then the p er-element r e c onstruction err or is b ounde d by: ∥ ˆ w − w ∥ 2 2 ≤ d 2 k 4 · n + ϵ FWHT (6) wher e ϵ FWHT is the flo ating-p oint r ounding err or of the 256-p oint IFWHT (at most O ( n · log n · u ) for machine epsilon u ). Pr o of. The ternary quan tization error satisfies | Q T ( x ) − x | ≤ d k / 2 for all x within the representable range. Squaring and summing ov er n = 256 elements gives ∥ q − H w ∥ 2 2 ≤ nd 2 k / 4. Since H is an isometry ( ∥ H v ∥ 2 = ∥ v ∥ 2 ), the error is preserved under the inv erse transform: ∥ H − 1 q − w ∥ 2 = ∥ H − 1 ( q − H w ) ∥ 2 = ∥ q − H w ∥ 2 , giving the stated b ound plus the finite-precision rounding term. Remark 1. The key insight is that the isometric pr op erty of H me ans the FWHT r otation do es not increase the quantization err or norm. Its b enefit lies entir ely in r e ducing d k : by smo othing the distribution of H w , the optimal ternary sc ale d ∗ k = 2 3 E [ | H w | ] is smal ler than it would b e for the r aw w , dir e ctly r e ducing the b ound. 3 3.3 Optimal T ernary Scale F or a Gaussian-distributed input x ∼ N (0 , σ 2 ), the mean squared error of ternary quan tization with threshold α is: MSE( α ) = Z − α −∞ ( x + α ) 2 ϕ ( x ) dx + Z α − α x 2 ϕ ( x ) dx + Z ∞ α ( x − α ) 2 ϕ ( x ) dx (7) where ϕ ( x ) = 1 σ √ 2 π e − x 2 / (2 σ 2 ) is the Gaussian density . Setting d MSE /dα = 0: α ∗ = σ √ 2 · erfin v  2 3  ≈ 0 . 798 σ (8) After the FWHT, the entries of H w are appro ximately N (0 , σ 2 ) (Theorem 1), so w e can compute α ∗ directly from the empirical standard deviation of the transformed block, giving a near-optimal scale without exp ensive second-order Hessian computation. 4 ITQ3 S F ormat Sp ecification 4.1 Blo c k Structure ITQ3 S organizes weigh ts into blo cks of n = 256 elements, aligned to the FWHT transform unit. Eac h blo ck is stored as: • Quan ts : 256 × 3 bits = 96 bytes of in terlea v ed ternary in tegers. • Scale ( d k ): 1 × FP16 = 2 bytes. • Zero-p oin t ( z k ): 1 × FP16 = 2 bytes (optional; absorb ed in to scale for symmetric distributions). • Sub-blo c k scales : 8 × FP16 = 16 b ytes for 8 sub-blocks of 32 elements each (optional, for higher fidelit y). T otal ov erhead p er 256 w eights: 96 + 2 + 2 = 100 bytes ⇒ 3.125 bits/weigh t. The sub-blo ck v ariant uses 116 b ytes ⇒ 3.625 bits/weigh t. 4.2 In terleav ed P ac king Eac h ternary v alue q ∈ { 0 , 1 , 2 } (representing {− 1 , 0 , +1 } with zero-p oint z = 1) is enco ded in 3 bits. F or a blo c k of 256 v alues, w e interlea v e t wo 4-bit nibble streams to form 32-bit w ords aligned for DP4A: w ord i = 7 M j =0 ( q 8 i + j mo d 4) ≪ 4 j (for ev en i ) (9) The high bit of each 4-bit nibble enco des the interlea ve selector, allo wing the CUDA dequantization k ernel to reconstruct full 3-bit v alues from tw o consecutive nibbles using a single 32-bit load and bitfield extraction, maximizing L1 cache utilization. Definition 1 (ITQ3 S Enco ding) . F or weight ve ctor w ∈ R 256 , the ITQ3 S enc o ding function is: Enc o de ( w ) =  Pack 3 b  Clamp  H w d k + z k + 0 . 5  , − 1 , 1  , d k , z k  (10) wher e d k = α ∗ / 1 is the optimal ternary sc ale for the blo ck, and z k is set to c anc el any non-zer o me an after tr ansformation. 4 4.3 Memory La y out for R TX 5090 The R TX 5090 (Blac kwell SM 100) features: • 192 KB shared memory per SM (up from 128 KB on Ada Lov elace). • 1024 threads p er SM, supporting full w arp o ccupancy for 256-point FWHT. • DP4A throughput: 4096 INT8 MA Cs/clo ck/SM ⇒ optimized b y aligning 3-bit unpac king to 32-bit word b oundaries. ITQ3 S blo cks are aligned to 128-byte cac he lines, ensuring eac h blo ck fits within 1 cache line (100 bytes < 128 bytes) and eliminates false sharing b etw een adjacen t blocks. 5 T urboQuant: CUD A Kernel Design 5.1 Offline Quan tization Pip eline Algorithm 1 ITQ3 S Offline Quan tization Require: W eigh t tensor W ∈ R M × N , block size n = 256 Ensure: Quan tized tensor Q , scales d , zero-p oints z 1: for each blo ck w ∈ R 256 in W do 2: w ′ ← FWHT( w ) ▷ F ast W alsh-Hadamard T ransform 3: d k ← α ∗ σ ( w ′ ) ▷ Optimal ternary scale (Section 3) 4: z k ← − round( µ ( w ′ ) /d k ) ▷ Zero-point offset 5: q ← Clamp(round( w ′ /d k ) + z k , − 1 , 1) ▷ T ernary quantization 6: Store(P ac k 3 b ( q ) , d k , z k ) 7: end for 5.2 Online Dequan tization Kernel The core contribution of T urb oQuant is fusing the 256-p oint In verse FWHT in to the shared-memory loading stage of the MMQ kernel, so that dequantized weigh ts are nev er materialized in global memory . The k ernel pro ceeds as follows: Algorithm 2 ITQ3 S MMQ Dequan tization Kernel ( load tiles itq3 s ) Require: P ac ked quan ts q , scale d k , zero-point z k Ensure: Reconstructed w eight tile in shared memory smem 1: Load : F etch interlea ved 3-bit quants from global memory into registers 2: Unpac k : Bitfield-extract ternary v alues ˜ q j ∈ {− 1 , 0 , 1 } p er thread 3: Dequan tize : v j ← d k · ( ˜ q j − z k ) 4: W rite v j to shared memory smem fwht[j] 5: Sync hronize : syncthreads() 6: for step ← 1 , 2 , 4 , . . . , 128 do ▷ log 2 256 = 8 butterfly stages 7: lo ← j mo d (2 · step); hi ← lo + step 8: u ← smem fwht[lo] ; v ← smem fwht[hi] 9: smem fwht[lo] ← u + v ; smem fwht[hi] ← u − v 10: Sync hronize : syncthreads() 11: end for 12: smem fwht[j] ← 0 . 0625 · smem fwht[j] ▷ Normalize: 1 / √ 256 = 0 . 0625 13: Pro ceed to matrix m ultiplication using smem as w eight tile 5 The normalization factor 1 / √ 256 = 1 / 16 = 0 . 0625 is applied once after all butterfly stages, consistent with the normalized FWHT conv ention of Eq. (3). This single multiply p er element is the only arithmetic o v erhead ov er standard IQ3 S dequan tization. 5.3 Correctness of the F used Kernel Prop osition 1 (Round-trip Exactness) . L et w ∗ = H 256 w b e the tr ansforme d weight and q = Q T ( w ∗ ) the ternary quantization. The kernel output satisfies: ˆ w = H − 1 256 ( d k ( q − z )) = H 256 ( d k ( q − z )) (11) and, up to finite-pr e cision butterfly arithmetic: ˆ w ≈ H − 1 256 ( H 256 w ) = w (12) with the appr oximation tight as quantization grid sp acing d k → 0 . Pr o of. F ollows directly from Theorem 2: since H 256 is in v olutory ( H 2 = I ), applying H 256 to the dequantized v alues d k ( q − z ) ≈ H 256 w recov ers w . 5.4 MMV Q P ath for T ok en Generation During autoregressive tok en generation, the batch size B = 1 makes the kernel compute a matrix-v ector pro duct (MMVQ). In this regime, each w arp handles a 32-element sub-blo ck of the weigh t vector, and the 256-p oin t FWHT decomp oses across 8 warps using warp-lev el sh uffle instructions: Listing 1: W arp-level 32-p oint FWHT approximation for MMVQ path / / S t a g e 1 : i n t r a - w a r p b u t t e r f l y ( 3 2 l a n e s ) f o r ( i n t s t e p = 1 ; s t e p < 3 2 ; s t e p < < = 1 ) { f l o a t p a r t n e r = _ _ s h f l _ x o r _ s y n c ( 0 x f f f f f f f f , v a l , s t e p ) ; v a l = ( l a n e _ i d & s t e p ) ? ( p r e v - p a r t n e r ) : ( p r e v + p a r t n e r ) ; p r e v = v a l ; } / / N o r m a l i z e v a l * = 0 . 1 7 6 7 7 f ; / / 1 / s q r t ( 3 2 ) F or full 256-p oint fidelit y in the MMVQ path, a 256-thread co op erativ e group p erforms 8 butterfly stages using shared memory , falling bac k to register sh uffles only when shared memory pressure requires it. 6 Exp erimen ts 6.1 Setup Hardw are : NVIDIA R TX 5090 (Blackw ell SM 100, 32 GiB GDDR7, 1792 GB/s bandwidth). Mo dels : LLaMA-3 8B, LLaMA-3 70B (sharded), Mistral 7B v0.3, Qwen2.5 32B. Baselines : FP16, Q8 0, Q4 K M (GGUF), IQ3 S (llama.cpp), IQ4 XS, QuIP#-3bit. Ev aluation : WikiT ext-2 perplexity ( ↓ ), C4 p erplexity ( ↓ ), tok ens/sec (prefill and deco de), memory footprint. 6.2 P erplexity Results T able 1 rep orts p erplexity on the WikiT ext-2 test set for LLaMA-3 8B. ITQ3 S reduces the perplexity gap to FP16 by 57% compared to IQ3 S (0 . 38 vs. 0 . 89) at comparable bit-width, and outp erforms QuIP#-3bit by 0 . 26 p erplexit y p oints with slightly higher bit-efficiency due to the smaller blo ck ov erhead. 6 T able 1: WikiT ext-2 Perplexit y vs. Bit-width for LLaMA-3 8B Metho d Bits/W eight PPL ↓ ∆ PPL vs. FP16 Mem (GiB) FP16 (baseline) 16.0 6.14 – 15.0 Q8 0 8.0 6.16 +0.02 7.5 Q4 K M 4.5 6.35 +0.21 4.8 IQ4 XS 4.3 6.41 +0.27 4.1 IQ3 S (baseline 3-bit) 3.5 7.03 +0.89 3.4 QuIP#-3bit 3.0 6.78 +0.64 3.0 ITQ3 S (ours) 3.125 6.52 +0.38 3.1 T able 2: Throughput on R TX 5090, LLaMA-3 8B, batc h size 1 (deco de) and 32 (prefill) Metho d Deco de (tok/s) Prefill (tok/s) Speedup vs. FP16 FP16 480 28,400 1.0 × Q4 K M 890 42,100 1.9 × IQ3 S 1,020 47,800 2.1 × ITQ3 S (ours) 960 51,200 2.0 × / 1.8 × 6.3 Throughput Results The IFWHT ov erhead reduces decode throughput sligh tly vs. IQ3 S (960 vs. 1020 tok/s), but prefill through- put incr e ases due to b etter T ensor Core utilization from the interlea ved memory lay out. The net result is a fav orable tradeoff: ITQ3 S provides substantially b etter qualit y at a modest throughput cost relative to baseline 3-bit metho ds. 6.4 Ablation: FWHT Blo c k Size W e ablate the FWHT block size n ∈ { 32 , 64 , 128 , 256 } on LLaMA-3 8B: T able 3: FWHT blo ck size ablation (ITQ3 S, LLaMA-3 8B, WikiT ext-2 PPL) Blo c k Size PPL ↓ Ov erhead (%) 32 6.81 0.3 64 6.67 0.7 128 6.59 1.4 256 (ITQ3 S) 6.52 2.1 512 6.51 4.8 n = 256 achiev es the b est quality-efficiency tradeoff: diminishing returns b eyond this point (PPL impro ves b y only 0 . 01 going to n = 512) do not justify the 2 . 3 × increase in IFWHT ov erhead. 7 Analysis 7.1 Wh y FWHT Rather Than Random Rotation? QuIP# [4] uses random Hadamard rotations (Kroneck er pro ducts H ⊗ k 2 ) applied at the matrix level. While theoretically sup erior for incoherence, random rotations require storing a random seed and reconstructing the rotation at inference time , adding latency . The FWHT is deterministic and universal : the same H 256 is applied to every blo ck, requiring no additional storage and enabling complete k ernel fusion. 7 Moreo v er, for blo ck sizes n ≤ 256, the theoretical distribution-smo othing b enefit of random vs. determin- istic WHT is negligible (as b oth ac hiev e near-Gaussian marginals by Theorem 1). ITQ3 S trades negligible theoretical fidelit y for significan t practical implemen tability . 7.2 In teraction with KV Cac he Quan tization ITQ3 S as describ ed targets weight quan tization. F or KV cache quan tization under long-context inference, the FWHT rotation can be applied token-b y-token along the head dimension, yielding a compatible activ ation quan tization sc heme. W e lea ve this extension to future work. 7.3 Scaling to 70B Models F or LLaMA-3 70B, ITQ3 S at 3.125 bits/weigh t requires ≈ 27 . 3 GiB, fitting within the R TX 5090’s 32 GiB VRAM with 4.7 GiB to spare for KV cache at a con text length of ∼ 16K tokens. This represents the first demonstration of a 70B-class model running at full single-GPU throughput on consumer hardw are without mo del sharding. 8 Limitations and F uture W ork Activ ation quan tization : ITQ3 S currently quan tizes only weigh ts; combining with 8-bit activ ation quan- tization could further reduce memory bandwidth consumption. T raining-aw are quan tization : Our metho d op erates p ost-training. In tegrating FWHT-a ware quan tization- a w are training (QA T) could further recov er accuracy at the cost of additional fine-tuning compute. Sparse weigh t supp ort : V ery large mo dels ( > 200 B parameters) may b enefit from com bining ITQ3 S with sparse w eight pruning, where the FWHT naturally supp orts sparsity-promoting thresholding in the transform domain. Non-p o w er-of-tw o la yers : Some arc hitecture v arian ts use hidden dimensions not divisible by 256. P adding strategies and their perplexity impact require further study . 9 Conclusion W e ha ve presented ITQ3 S, a mathematically rigorous 3-bit weigh t quantization format that achiev es near- FP16 LLM quality on the NVIDIA R TX 5090. By grounding the design in the distribution-smo othing prop erties of the W alsh-Hadamard T ransform (Theorem 1) and pro ving exact round-trip reconstruction up to quan tization grid error (Theorem 2), w e establish a principled foundation for sub-4-bit inference. The T urb oQuan t CUDA k ernel fuses the 256-p oint IFWHT in to shared-memory loading with only 2.1% compute o v erhead, yielding a practical system that reduces WikiT ext-2 p erplexity gap to FP16 by 57% versus the IQ3 S baseline while fitting 70B-class models in a single 32 GiB consumer GPU. W e b eliev e ITQ3 S represents a step tow ard demo cratizing access to frontier-scale AI: enabling individ- ual researc hers and enth usiasts to run, study , and build upon large mo dels without dep endence on cloud infrastructure. Ac kno wledgmen ts The author thanks the llama.cpp and ggml comm unities for the IQ3 S reference implementation, and the QuIP# authors for op en-sourcing their rotation-based quantization framework. References [1] E. F rantar, S. Ashkb o os, T. Ho efler, and D. Alistarh. GPTQ: Accurate p ost-training quan tization for generativ e pre-trained transformers. In ICLR , 2023. 8 [2] J. Lin, J. T ang, H. T ang, S. Y ang, X. Dang, and S. Han. A W Q: Activ ation-a w are weigh t quantization for LLM compression and acceleration. , 2023. [3] S. Kim, C. Ho op er, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney , and K. Keutzer. SqueezeLLM: Dense-and-sparse quan tization. In ICML , 2024. [4] A. Tseng, J. Chee, Q. Sun, V. Kuleshov, and C. De Sa. QuIP#: Ev en b etter LLM quantization with Hadamard incoherence and lattice codeb o oks. In ICML , 2024. [5] T. Dettmers, M. Lewis, Y. Belk ada, and L. Zettlemo y er. LLM.in t8(): 8-bit matrix m ultiplication for transformers at scale. In NeurIPS , 2022. [6] T. Dettmers, R. Svirschevski, V. Egiazarian, D. Kuznedelev, E. F rantar, S. Ashkbo os, A. Borzuno v, T. Ho efler, and D. Alistarh. SpQR: A sparse-quan tized represen tation for near-lossless LLM w eight compression. In ICLR , 2024. [7] H. T ouvron et al. LLaMA: Op en and efficien t foundation language mo dels. , 2023. [8] Z. Y ao, R. Y azdani Aminabadi, M. Zhang, X. W u, C. Li, and Y. He. ZeroQuant: Efficient and affordable p ost-training quan tization for large-scale transformers. In NeurIPS , 2022. [9] J. Hadamard. R ´ esolution d’une question relative aux d´ eterminan ts. Bul letin des Scienc es Math ´ ematiques , 17:240–246, 1893. [10] C. E. Shannon. A mathematical theory of communication. Bel l System T e chnic al Journal , 27(3):379– 423, 1948. A Pro of of Optimal T ernary Scale (F ull Deriv ation) W e derive α ∗ = 0 . 798 σ for the zero-mean Gaussian case. The MSE of ternary quantization with threshold α and scale α for x ∼ N (0 , σ 2 ) is: MSE( α ) = 2 Z ∞ α ( x − α ) 2 ϕ ( x ) dx + Z α − α x 2 ϕ ( x ) dx (13) = 2 Z ∞ α x 2 ϕ ( x ) dx − 4 α Z ∞ α xϕ ( x ) dx + 2 α 2 (1 − Φ( α/σ )) + Z α − α x 2 ϕ ( x ) dx (14) where Φ is the standard normal CDF. Differen tiating with resp ect to α and setting equal to zero: d MSE dα = 4  α − Z ∞ α xϕ ( x ) dx  (1 − Φ( α/σ )) − 2 α 2 ϕ ( α/σ ) /σ = 0 (15) Since R ∞ α xϕ ( x ) dx = σ 2 ϕ ( α/σ ) /σ , this simplifies to: 4 ( α − σ ϕ ( α/σ )) (1 − Φ( α/σ )) = 2 α 2 ϕ ( α/σ ) /σ (16) Substituting t = α/σ and solving n umerically: t ∗ ≈ 0 . 7979, giving α ∗ ≈ 0 . 798 σ . □ B CUD A Kernel: F ull 256-p oin t IFWHT Listing 2: 256-p oint IFWHT in CUD A shared memory (simplified) _ _ d e v i c e _ _ v o i d i f w h t _ 2 5 6 ( f l o a t * s m e m , i n t t i d ) { / / 8 b u t t e r f l y s t a g e s f o r n = 2 5 6 # p r a g m a u n r o l l f o r ( i n t s t e p = 1 ; s t e p < 2 5 6 ; s t e p < < = 1 ) { i n t p a i r = t i d ^ s t e p ; 9 b o o l i s _ h i g h = ( t i d & s t e p ) ! = 0 ; f l o a t u = s m e m [ t i d ] ; f l o a t v = s m e m [ p a i r ] ; _ _ s y n c t h r e a d s ( ) ; s m e m [ t i d ] = i s _ h i g h ? ( v - u ) : ( u + v ) ; _ _ s y n c t h r e a d s ( ) ; } / / N o r m a l i z e : 1 / 2 5 6 ^ { 1 / 2 } = 0 . 0 6 2 5 s m e m [ t i d ] * = 0 . 0 6 2 5 f ; } _ _ d e v i c e _ _ v o i d l o a d _ t i l e s _ i t q 3 _ s ( f l o a t * _ _ r e s t r i c t _ _ d s t , c o n s t u i n t 8 _ t * _ _ r e s t r i c t _ _ s r c _ q u a n t s , c o n s t h a l f * _ _ r e s t r i c t _ _ s r c _ s c a l e s , i n t b l o c k _ i d x , i n t t i d ) { e x t e r n _ _ s h a r e d _ _ f l o a t s m e m [ ] ; / / S t e p 1 : L o a d a n d u n p a c k 3 - b i t t e r n a r y v a l u e i n t 3 b _ t r a w = u n p a c k _ 3 b i t ( s r c _ q u a n t s , b l o c k _ i d x * 2 5 6 + t i d ) ; f l o a t d q = _ _ h a l f 2 f l o a t ( s r c _ s c a l e s [ b l o c k _ i d x ] ) ; / / S t e p 2 : D e q u a n t i z e t o f l o a t s m e m [ t i d ] = d q * ( f l o a t ) ( r a w - 1 ) ; / / { 0 , 1 , 2 } - > { - 1 , 0 , 1 } _ _ s y n c t h r e a d s ( ) ; / / S t e p 3 : I n - p l a c e 2 5 6 - p o i n t I F W H T i f w h t _ 2 5 6 ( s m e m , t i d ) ; / / S t e p 4 : W r i t e r e c o n s t r u c t e d w e i g h t t o d e s t i n a t i o n d s t [ t i d ] = s m e m [ t i d ] ; } 10

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment