CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8% Top-1 accuracy after pruning 50% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.

💡 Research Summary

Vision Transformers achieve state‑of‑the‑art performance but are computationally heavy, limiting deployment on resource‑constrained devices. Structured pruning—removing whole channels, heads, or MLP dimensions—offers practical speedups, yet most existing methods rely on retraining, multi‑stage optimization, or access to labels and gradients. This paper introduces CORP (Closed‑Form One‑shot Representation‑Preserving structured pruning), a post‑training framework that prunes Vision Transformers in a single forward pass without any fine‑tuning.

The key insight is that the primary cause of accuracy collapse after one‑shot structured pruning is representation error, not poor importance ranking. When a set of channels or head dimensions is removed, the remaining network receives biased intermediate activations that accumulate across layers. CORP treats pruning as a representation recovery problem: it explicitly models the activations (for MLPs) or attention logits (for self‑attention) that are lost due to pruning as affine functions of the retained components, and then solves for the affine parameters in closed form using a small unlabeled calibration set.

For MLP blocks, the original output is (y = W_S x_S + W_P x_P + b). The pruned activations (x_P) are approximated by (B x_S + c). Ridge regression on calibration data yields (B = \Sigma_{PS}(\Sigma_{SS} + \lambda I)^{-1}) and (c = \mu_P - B\mu_S). Substituting back gives compensated weights (\hat W_S = W_S + W_P B) and bias (\hat b = b + W_P c). This eliminates the need to compute the pruned channels at inference time.

For attention, only the query and key projections are pruned (the value projection is kept to preserve downstream dimensions). The missing logit term (Q_P K_P^\top) is approximated by (Q_S M K_S^\top). The matrix (M) satisfies a regularized Sylvester equation ((Q_S^\top Q_S) M (K_S^\top K_S) + \lambda M = (Q_S^\top Q_P)(K_P^\top K_S)). Solving this yields (M), which is then factorized as (I + M = R \Sigma^\top). The compensated projections become (\hat W_{Q,S}=W_{Q,S} R \Sigma^{1/2}) and (\hat W_{K,S}=W_{K,S} \Sigma^{1/2}). Again, no extra computation is required during inference.

The entire CORP pipeline consists of: (1) collecting first‑ and second‑order activation statistics from a few thousand unlabeled images; (2) ranking channels or head dimensions by simple importance scores (e.g., variance for MLP channels, expected logit energy for attention dimensions); (3) pruning the lowest‑scoring structures; (4) computing the closed‑form compensation parameters; and (5) folding these parameters back into the model weights. The computational overhead is modest: forming covariance matrices costs (O(N d^2)) and solving small linear systems costs (O(|S|^3)) per layer; for attention the dominant cost is solving a Sylvester equation of size (d’_h \times d’_h). In practice, pruning a full DeiT‑Huge model on a single GPU finishes in under 20 minutes.

Experiments on ImageNet‑1k with DeiT‑Tiny, Small, Base, Large, and Huge models demonstrate the effectiveness of CORP. Without compensation, 50 % structured sparsity (pruning half of both MLP hidden dimensions and attention head dimensions) reduces Top‑1 accuracy dramatically (e.g., from 85 % to ~36 % for DeiT‑Tiny). With CORP, the same sparsity retains high accuracy: DeiT‑Huge keeps 82.8 % Top‑1 (vs. 85 % unpruned), and DeiT‑Base retains 81.7 % (vs. 81.7 % unpruned). FLOPs are roughly halved, and real‑world throughput improves by 2–3× on GPU and CPU hardware.

The paper’s contributions are: (1) identifying representation error as the bottleneck for one‑shot structured pruning; (2) reformulating pruning as a closed‑form representation recovery problem; (3) providing ridge‑regression‑based compensation for MLPs and Sylvester‑equation‑based compensation for attention; (4) demonstrating that, under strict post‑training constraints, the quality of the compensation dominates over the choice of importance ranking.

Limitations include dependence on a representative calibration set; if the calibration data does not capture the full data distribution, the learned affine predictors may generalize poorly. Moreover, the current design only prunes query and key projections, leaving value projections untouched; extending compensation to value pruning would require additional derivations.

In summary, CORP offers a practical, label‑free, one‑shot structured pruning solution for Vision Transformers, achieving aggressive sparsity while preserving accuracy and delivering tangible inference speedups, thereby advancing the state of post‑training model compression.

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

💡 Research Summary

Comments & Academic Discussion

Leave a Comment