PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery
Model pruning is an effective approach for compressing large language models (LLMs). However, this process often leads to significant degradation of model capabilities. While post-training techniques such as instruction tuning are commonly employed to recover model performance, existing methods often overlook the uneven deterioration of model capabilities and incur high computational costs. Moreover, some irrelevant instructions may also introduce negative effects to model capacity recovery. To address these challenges, we propose the \textbf{P}ost-training d\textbf{A}ta \textbf{S}election method for \textbf{E}fficient pruned large language model \textbf{R}ecovery (\textbf{PASER}). PASER aims to identify instructions to recover the most compromised model capacities with a certain data budget. Our approach first applies manifold learning and spectral clustering to group recovery instructions in the semantic space, revealing capability-specific instruction sets. Then, the data budget is adaptively allocated across clusters by the degree of corresponding model capability degradation. In each cluster, we prioritize data samples that lead to the most decline of model performance. To mitigate potential negative tuning effects, we also detect and filter out conflicting or irrelevant recovery data. Extensive experiments demonstrate that PASER significantly outperforms conventional baselines, effectively recovering the general capabilities of pruned LLMs while utilizing merely 4%-20% of the original post-training data. We provide the code repository in \href{https://github.com/BokwaiHo/PASER}{Link}.
💡 Research Summary
The paper introduces PASER, a post‑training data selection framework designed to efficiently recover the capabilities of large language models (LLMs) that have been compressed through pruning. While pruning dramatically reduces model size, it inevitably degrades performance, and the degradation is often uneven across different functional abilities (e.g., reasoning, mathematics, code generation). Existing recovery methods typically fine‑tune the pruned model on the full instruction‑tuning dataset, which incurs high computational cost and may still fail to restore the most damaged capabilities.
PASER addresses these issues through three tightly coupled components. First, it clusters the entire instruction set into capability‑specific groups. Each instruction is embedded with Sentence‑BERT, then a diffusion kernel is applied to capture non‑linear relationships and produce a low‑dimensional manifold representation. Non‑negative‑matrix‑factorization (NMF) based spectral clustering is performed on these representations, yielding K clusters that correspond to distinct model abilities.
Second, PASER quantifies how much each capability has deteriorated after pruning. For every cluster, it computes a Capability Degradation Score (CDS) by measuring the Jensen‑Shannon Divergence (JSD) between the token‑level output distributions of the original model and the pruned model. JSD’s symmetry, bounded range, and robustness to outliers make it a reliable indicator of subtle performance loss. The overall data budget (typically 4 %–20 % of the original instruction set) is then allocated proportionally to the CDS of each cluster, ensuring that the most severely damaged abilities receive more training examples. Within each cluster, PASER prioritizes samples that cause the greatest loss in the pruned model’s predictions, while also accounting for the computational cost of each sample (e.g., token length) to maximize training efficiency.
Third, PASER mitigates negative tuning effects by constructing a concept‑consistency graph among the selected samples. Nodes represent instructions, and edges encode semantic similarity and alignment with the target capability. Graph analysis identifies and removes conflicting or irrelevant instructions that could confuse the model during fine‑tuning, thereby reducing the risk of performance degradation caused by noisy data.
The authors provide a theoretical analysis of PASER’s time complexity (approximately O(N log N)) and derive an error bound for the sampling process, offering guarantees on both efficiency and effectiveness. Empirical evaluation spans a diverse set of LLMs—including LLaMA 2/3/3.1, Baichuan 2, Qwen 2.5/3, and Mixtral 8×7B—under various pruning schemes (unstructured, semi‑structured, and structured). Compared with baselines that use the full dataset, random subsets, or existing data‑selection methods (e.g., LESS, IFD), PASER consistently achieves higher recovery performance while using a fraction of the data. Notably, for high‑difficulty tasks such as mathematical reasoning and code generation, PASER restores performance to within a few percentage points of the original unpruned model, whereas baselines lag significantly. Training time is reduced by 30 %–45 %, and memory consumption is also lowered due to the smaller dataset.
In summary, PASER offers a principled, capability‑aware approach to post‑pruning recovery: it clusters instructions by semantic structure, allocates data budget based on measured degradation, selects the most impactful samples, and filters out harmful data. This results in a cost‑effective recovery pipeline that can be readily applied in production settings where pruned LLMs need to be redeployed with minimal loss of functionality.
Comments & Academic Discussion
Loading comments...
Leave a Comment