GHOST: Unmasking Phantom States in Mamba2 via Grouped Hidden-state Output-aware Selection & Truncation
While Mamba2’s expanded state dimension enhances temporal modeling, it incurs substantial inference overhead that saturates bandwidth during autoregressive generation. Standard pruning methods fail to address this bottleneck: unstructured sparsity leaves activations dense, magnitude-based selection ignores runtime dynamics, and gradient-based methods impose prohibitive costs. We introduce GHOST (Grouped Hidden-state Output-aware Selection and Truncation), a structured pruning framework that approximates control-theoretic balanced truncation using only forward-pass statistics. By jointly measuring controllability and observability, GHOST rivals the fidelity of gradient-based methods without requiring backpropagation. As a highlight, on models ranging from 130M to 2.7B parameters, our approach achieves a 50% state-dimension reduction with approximately 1 perplexity point increase on WikiText-2. Code is available at https://anonymous.4open.science/r/mamba2_ghost-7BCB/.
💡 Research Summary
The paper addresses a critical bottleneck in the recently introduced Mamba2 architecture: the expansion of the state dimension from 16 to 128 dramatically improves temporal modeling but also inflates the recurrent state memory from roughly 12 MB to 100 MB for a 1.3 B‑parameter model. This increase saturates memory bandwidth during autoregressive generation, making inference prohibitively slow despite the model’s strong performance. Existing compression techniques—unstructured sparsity, magnitude‑based pruning, and gradient‑based structured pruning—are inadequate. Unstructured sparsity leaves activations dense, magnitude pruning relies on static weight norms that do not correlate with actual runtime usage (leading to the loss of “phantom states” and retention of inert “corporeal states”), and gradient‑based methods demand excessive GPU memory (≈45 GB for a 1.3 B model) and suffer from distribution shift when applied layer‑by‑layer.
To overcome these limitations, the authors propose GHOST (Grouped Hidden‑state Output‑aware Selection and Truncation), a structured pruning framework that approximates control‑theoretic balanced truncation using only forward‑pass statistics. The key insight is to treat each state channel as a controllability‑observability pair: controllability is estimated by the empirical covariance of the hidden state H_t (i.e., how much the input history excites the channel), while observability is approximated by the Hessian of the output energy with respect to the state, which reduces to the squared norm of the projection matrix C′. The product of these two diagonal terms yields a saliency score S_i that mirrors the Hankel singular values used in classic balanced truncation.
Because Mamba2 employs Grouped Query Attention (GQA), the method aggregates scores across heads that share dynamics parameters. For each layer, scores from all G groups (each containing K heads) are pooled, sorted, and a global threshold τ_j is set to achieve a target sparsity κ. Channels with scores below τ_j are masked out, zeroing the corresponding columns in the B and C projection matrices as well as any associated convolution filters. Pruning proceeds sequentially layer‑by‑layer, with each pruned layer’s activations used to calibrate the next, thereby mitigating distribution shift without any back‑propagation.
Algorithmically, GHOST requires only two forward passes over a small calibration set (128 WikiText‑2 samples in the experiments) per layer, yielding a time complexity comparable to a normal inference pass (O(|D_cal|·L·G·K·P·N)) and a space complexity of O(G·N). No gradients, Hessians, or large covariance matrices are stored, making it feasible on a single H100 GPU with 80 GB memory.
Empirical results span models from 130 M to 2.7 B parameters. At 50 % state‑dimension reduction, GHOST incurs roughly a 1‑point increase in perplexity on WikiText‑2 while cutting the recurrent state memory by half, directly reducing bandwidth consumption. Compared against magnitude pruning, random pruning, and Taylor (gradient‑based) pruning, GHOST matches or exceeds the latter’s accuracy while using a fraction of the computational and memory budget. Additional experiments demonstrate robustness across varying sequence lengths, out‑of‑distribution datasets (e.g., PTB, enwik8), and downstream tasks evaluated via the EleutherAI LM Evaluation Harness.
In summary, GHOST offers a practical, data‑driven approach to structured pruning of large state‑space models, achieving balanced‑truncation‑like fidelity without the need for expensive gradient calculations. It effectively identifies and preserves high‑activity “phantom” states while discarding low‑utility “corporeal” states, delivering substantial inference speed‑ups and bandwidth savings, thereby democratizing deployment of large‑scale SSMs such as Mamba2. The code and reproducibility details are publicly released.
Comments & Academic Discussion
Loading comments...
Leave a Comment