Efficient Post-Training Pruning of Large Language Models with Statistical Correction
Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.
💡 Research Summary
The paper tackles the long‑standing trade‑off in post‑training pruning of large language models (LLMs) between pruning quality and computational cost. Existing heuristic methods (e.g., magnitude‑based or weight‑activation products such as WANDA) are fast but suffer from sensitivity to activation outliers, leading to sub‑optimal sparsity patterns. Reconstruction‑based approaches (e.g., SparseGPT) achieve higher fidelity by using second‑order information to adjust remaining weights, but they incur heavy memory and runtime overhead due to Hessian approximations and iterative updates.
The authors propose a lightweight, two‑stage framework that relies solely on first‑order statistics of pretrained weights and activations, thereby avoiding gradients, second‑order tensors, or any retraining.
Stage 1 – Variance‑Calibrated Importance (CVR).
For each input channel (j), they compute the variance of the corresponding weight column (v_j = \frac{1}{d_{out}}\sum_i W_{ij}^2) and form a calibration factor (c_j = (v_j+\epsilon)^{-\alpha/2}). Simultaneously, they estimate the activation variance (v^x_j) on a small calibration dataset and derive an activation‑based factor (a_j = (v^x_j)^{1/4}). The final importance score for weight (W_{ij}) becomes
(S_{ij}=|W_{ij}|\cdot a_j \cdot c_j).
Thus, channels with high weight variance (unstable) are down‑weighted, while those with extreme activation variance are also suppressed. The mask is built by selecting the lowest‑scoring weights to meet a target sparsity. Importantly, this step needs only a single pass over the calibration data and the pretrained weight matrix.
Stage 2 – Energy‑Compensated Weight Adjustment (EC).
After the mask is fixed, the remaining weights (\tilde{W}=M\odot W) typically exhibit reduced ℓ₂ energy, causing a systematic shrinkage of layer outputs. The authors address two effects: (1) mean shift, and (2) energy collapse. They compute column‑wise and row‑wise means of the original weight matrix ((\mu_{col},\mu_{row})), center both original and pruned matrices, and then rescale each column and row by a factor that matches the original centered energy:
(s_{col,j}= \sqrt{\frac{E^{orig}{col}(j)}{E^{pruned}{col}(j)}+\epsilon}) and similarly for rows.
The corrected weight matrix is obtained by applying these scalings and re‑applying the mask, preserving exact sparsity. This analytic correction is closed‑form, non‑iterative, and independent of the pruning criterion.
Experimental Evaluation.
The method is tested on three families of LLMs—Meta’s LLaMA‑2 (7B/13B/70B), LLaMA‑3 (8B/70B), and Qwen2.5 (7B/14B/32B/72B)—across a range of sparsity levels (50 % to 90 %) and both uniform (e.g., 2:4, 4:8) and irregular patterns. Perplexity on WikiText‑2 and downstream task metrics are reported. Across the board, the CVR + EC pipeline outperforms plain magnitude pruning and WANDA, often narrowing the gap to reconstruction‑based methods while retaining the computational footprint of heuristic approaches. For example, at 50 % sparsity, CVR + EC reduces perplexity by roughly 10–20 % relative to WANDA and matches or beats SparseGPT’s performance without any second‑order computation. Runtime analysis shows only a modest (≈5 %) increase over pure heuristic pruning, confirming the method’s practicality for large‑scale deployment.
Contributions.
- Introduces a variance‑calibrated importance metric that mitigates activation‑driven bias without extra gradient or Hessian calculations.
- Proposes an analytic energy‑matching correction that restores layer‑wise signal scale after pruning.
- Demonstrates that both steps are criterion‑agnostic, enabling seamless integration with existing pruning pipelines.
- Provides extensive empirical evidence of consistent gains across multiple model families, sparsity regimes, and evaluation tasks, all with negligible overhead.
Implications and Future Work.
The work shows that simple statistical corrections—weight variance and activation variance for selection, and energy matching for post‑pruning adjustment—are sufficient to achieve high‑quality sparse LLMs without costly reconstruction. This opens the door to fast, on‑device model compression, rapid iteration cycles, and broader accessibility of LLMs on resource‑constrained hardware. Future research may extend the framework to other architectural components (e.g., multi‑head attention matrices, layer‑norm parameters), explore adaptive calibration of the hyper‑parameter (\alpha), or combine the approach with structured block‑pruning and hardware‑aware sparsity patterns.
Comments & Academic Discussion
Loading comments...
Leave a Comment