A Practical Tensor-Network Compression Pipeline for Production-Scale Large Language Models
Large language models are limited in deployment by GPU memory and inference latency. We present Minima, a production compression pipeline that learns where and how to structurally compress a Transformer and turns that compression into real serving gains. Minima trains a lightweight convolutional predictor to estimate layer- and patch-level sensitivity, applies a mixture of Tucker, tensor-train, and tensor-ring decompositions to low-sensitivity regions, performs a short healing fine-tune, and executes the resulting operators with custom Triton and CUDA kernels. The reduced memory footprint enables speculative decoding with a small draft model and a larger verifier. On Qwen3-32B at an 8k-token context window, Minima reduces peak VRAM from 64 GiB to 40 GiB. For a single active request, throughput increases from 40 tokens per second (baseline) to 50 tokens per second (Minima) and 75 tokens per second (Minima with speculative decoding). Under 50 parallel requests, throughput is 34, 44, and 53 tokens per second respectively, showing that Minima remains effective under high concurrency even when speculative decoding gains compress. We position Minima relative to recent tensor-network, low-rank plus quantization, and cross-layer sharing methods, and argue that it is a practical step toward more aggressive structural compression via shared tensor backbones with tiny per-layer adapters.
💡 Research Summary
The paper introduces Minima, a production‑ready compression pipeline for large language models (LLMs) that combines tensor‑network (TN) decompositions, a learned sensitivity predictor, runtime kernel engineering, and speculative decoding to achieve substantial memory savings and speed‑ups without sacrificing quality. The pipeline consists of five stages.
-
Analyze – A lightweight convolutional neural network (CNN) predicts a per‑layer, per‑patch sensitivity score and a recommended compression budget (including which TN family to use and the target rank). The CNN is trained on a small set of patches where various decompositions and ranks have been evaluated, using cheap statistics such as local singular‑value spectra, condition numbers, magnitude, sparsity, and positional information. This replaces manual profiling and can generate a full compression plan for a 32‑billion‑parameter model in roughly 20 minutes.
-
Compress – Guided by the sensitivity map, low‑sensitivity patches are compressed with a mixture of Tucker, Tensor‑Train (TT), and Tensor‑Ring (TR) decompositions. Tucker is chosen for moderately rectangular matrices, TT for very long or skinny/fat matrices, and TR where cyclic symmetry is natural. Ranks are set to meet the predicted compression ratio, yielding a 35‑40 % reduction in parameters and a 37 % drop in peak VRAM (64 GiB → 40 GiB for Qwen3‑32B at an 8k context).
-
Heal – A short fine‑tuning phase (a few thousand batches) restores most of the lost quality. Perplexity increase is limited to ≤ 3 % relative, and benchmark accuracies stay within ±1 percentage point of the dense baseline.
-
Optimize Kernels – Custom Triton/CUDA kernels are written for the structured matrix‑multiplications that arise from the TN factors. By re‑ordering data, exploiting warp‑level parallelism, and avoiding dense BLAS calls, inference throughput improves an additional ~10 % beyond the raw compression gains, reaching ~50 tokens‑per‑second (TPS) from the original ~40 TPS.
-
Speculative Decoding – The freed memory enables a small draft model to generate candidate tokens, which a larger “verifier” (the compressed model) validates in batches. This two‑stage generation reduces the number of expensive forward passes per token, pushing throughput to ~75 TPS for a single request. The benefit scales to high concurrency: with 50 parallel requests, Minima delivers 34 → 44 → 53 TPS for baseline, compressed‑only, and compressed + speculative configurations respectively.
Ablation studies show that (i) the learned sensitivity predictor allows more aggressive rank reduction than uniform selection, (ii) the mixed‑TN approach outperforms any single decomposition, and (iii) speculative decoding synergizes with compression because the verifier can be smaller and faster.
Compared with prior work—such as CompactifAI, TensorLLM, low‑rank + quantization schemes, and cross‑layer sharing methods—Minima distinguishes itself by (a) using a heterogeneous TN toolbox rather than a single MPO or block‑term format, (b) automating the compression decision process, (c) delivering end‑to‑end latency improvements through custom kernels, and (d) integrating speculative decoding as a complementary speed‑up. The authors also discuss how Minima could evolve toward a shared‑tensor backbone with per‑layer adapters, further increasing compression ratios while preserving adaptability.
In summary, Minima demonstrates that practical, production‑scale structural compression of LLMs is achievable: it halves memory usage, roughly doubles inference throughput, requires only a brief post‑training fine‑tune, and can be combined with other techniques such as quantization for even greater efficiency. This makes it a compelling step toward more aggressive, globally shared tensor architectures for future LLM deployments.
Comments & Academic Discussion
Loading comments...
Leave a Comment