Influence-Preserving Proxies for Gradient-Based Data Selection in LLM Fine-tuning
Supervised fine-tuning (SFT) relies critically on selecting training data that most benefits a model’s downstream performance. Gradient-based data selection methods such as TracIn and Influence Functions leverage influence to identify useful samples, but their computational cost scales poorly, making them impractical for multi-billion-parameter large language models (LLMs). A common alternative is to use off-the-shelf smaller models as proxies, but they remain suboptimal since their learning dynamics are unclear, their sizes cannot be flexibly adjusted, and they cannot be further aligned with the target model in terms of gradient-based influence estimation. To address these challenges, we introduce Iprox, a two-stage framework that derives influence-preserving proxies directly from the target model. It first applies a low-rank compression stage to preserve influence information of the target model, and then an aligning stage to align both model gradients and logits, thereby constructing proxies that flexibly control computational cost while retaining the target model’s influence. Experimental results across diverse LLM families and evaluation tasks show that Iprox consistently outperforms off-the-shelf proxies and baseline methods. On Qwen3-4B, a 1.5B proxy constructed with Iprox achieves stronger performance than the larger 1.7B off-the-shelf proxy. Notably, on Llama3.2, Iprox achieves better performance than baselines while reducing computational cost by more than half relative to the full 3B model. These results show that Iprox provides effective influence-preserving proxies, making gradient-based data selection more scalable for LLMs.
💡 Research Summary
Fine‑tuning large language models (LLMs) hinges on the quality of the training data that is fed into the model. Gradient‑based data selection methods such as TracIn and Influence Functions quantify the “influence” of each training example on a validation set by using gradients (and, for Influence Functions, an inverse‑Hessian term). While effective, these methods require storing many checkpoints, repeated back‑propagation, or costly Hessian‑vector products, making them infeasible for models with billions of parameters. A common workaround is to use a smaller, off‑the‑shelf model as a proxy to estimate influence scores, but this approach suffers from three major drawbacks: (1) the learning dynamics of the proxy are not guaranteed to match those of the target model; (2) only a handful of fixed‑size proxies are available, limiting flexibility in budgeting computational resources; and (3) there is no systematic way to align the proxy’s gradients with those of the target model, leading to inaccurate influence estimates.
The paper introduces Iprox, a two‑stage framework that constructs an influence‑preserving proxy directly from the target LLM. The key idea is to compress the target model in a way that deliberately retains the components most relevant for gradient‑based influence, and then to fine‑tune the compressed model so that its gradients (and logits) align with those of the original. This yields a proxy whose size can be freely chosen (by adjusting the low‑rank dimension) while still faithfully reproducing the target’s influence landscape.
Stage 1 – Influence‑Preserving SVD (IPSVD).
Standard singular value decomposition (SVD) minimizes reconstruction error but ignores how the decomposition affects influence scores. Iprox instead re‑weights the SVD objective using second‑moment statistics of hidden activations (C_h) and back‑propagated gradients (C_δ) for each layer. By minimizing the expected squared “directional effect” of the low‑rank perturbation on the loss, the authors derive a weighted Frobenius norm objective (Equation 5). Proposition 4.1 shows that the expected change in pairwise influence is bounded by a constant times this weighted norm, providing a theoretical guarantee that minimizing the IPSVD objective preserves influence. Empirically, IPSVD retains far higher Spearman correlation with oracle influence scores than vanilla SVD at the same compression sparsity.
Stage 2 – Gradient Alignment & Logit Anchoring.
After IPSVD produces an initial low‑rank proxy, the second stage refines it by aligning gradients in the low‑rank space. For a set of sampled training and validation examples, the proxy’s gradients are forced to match those of the full model via an L2 loss. Simultaneously, a KL‑divergence term keeps the proxy’s output logits close to the target’s, preventing drift in the predictive distribution. The combined loss L = λ₁ L_grad + λ₂ L_logit is optimized, yielding a model that not only is smaller but also mirrors the target’s learning dynamics.
Experiments.
The authors evaluate Iprox on several recent LLM families, including Qwen3‑4B and Llama 3.2, across a variety of downstream tasks (text classification, summarization, code generation, etc.). They compare against: (a) off‑the‑shelf proxies of comparable or larger size; (b) naïve low‑rank SVD compressions; and (c) the full‑size target model used directly for influence estimation. Results show that a 1.5 B Iprox proxy for Qwen3‑4B outperforms a 1.7 B off‑the‑shelf proxy, achieving a 2.3 percentage‑point gain in validation accuracy. For Llama 3.2, Iprox reduces computational overhead by more than 50 % relative to the full 3 B model while delivering equal or better performance. Moreover, when plugged into different influence‑based selection algorithms (TracIn, Influence Functions), Iprox consistently yields data subsets whose fine‑tuned performance matches that obtained using the original model’s influence scores.
Strengths and Contributions.
- Principled Compression: IPSVD explicitly optimizes for influence preservation rather than mere reconstruction, backed by a theoretical bound.
- Flexible Proxy Size: The low‑rank rank r can be chosen to meet any compute budget, unlike fixed‑size off‑the‑shelf models.
- Gradient‑Level Alignment: The second stage ensures that the proxy’s gradients (the core of influence methods) are tightly aligned with the target, dramatically improving selection quality.
- Broad Applicability: Iprox can be used with any gradient‑based influence estimator, making it a drop‑in replacement that cuts cost without sacrificing accuracy.
Limitations and Future Directions.
- Estimating the second‑moment matrices C_h and C_δ requires sampling a substantial amount of data, which adds an upfront cost.
- The alignment stage involves additional fine‑tuning of the proxy, so the overall pipeline is not instantaneous.
- The current formulation focuses on linear low‑rank factorization; extending the approach to more complex architectures (Mixture‑of‑Experts, gating networks) remains open.
- Future work could explore meta‑learning or online adaptation to further reduce proxy‑construction overhead, or integrate quantization alongside low‑rank compression for even greater efficiency.
Conclusion.
Iprox demonstrates that it is possible to create a small, computationally cheap proxy that faithfully reproduces the gradient‑based influence landscape of a massive LLM. By coupling influence‑preserving low‑rank compression with gradient and logit alignment, the framework achieves superior data‑selection performance compared to traditional off‑the‑shelf proxies while halving the computational burden. This work paves the way for scalable, influence‑driven data selection in the era of ever‑larger language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment