Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same task was already tackled in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, these vectors often fail to transfer across different pre-trained models because their parameter spaces are misaligned. In this work, we show that successful transfer depends strongly on the gradient-sign structure of the new model. Based on this insight, we propose GradFix, which approximates the ideal sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: we only compute a few target-model gradients without parameter updates and mask the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. We further show that transporting task vectors improves multi-task and multi-source model merging. Code is available at https://github.com/fillo-rinaldi/GradFix.
💡 Research Summary
The paper tackles the costly redundancy that arises when a new version of a foundation model is released and practitioners must re‑fine‑tune it for the same downstream tasks. While “task vectors” – the parameter differences between a fine‑tuned model and its base – have been proposed as a reusable artifact, they typically fail to transfer across different pre‑trained checkpoints because the underlying parameter spaces are misaligned. The authors identify the key obstacle: the local loss geometry of the target model. They observe that the sign of a gradient provides a robust, inexpensive proxy for the descent direction of the loss surface.
Building on this insight, they introduce GradFix, a method that transports a source task vector τ_A to a target pre‑trained model B by masking it with the sign of the target’s gradient. Concretely, they compute a few gradients g = ∇θ L(θ_B) on a small labeled set (or even a single batch), form a binary mask m_i = 1{sign(τ_A,i) = sign(−g_i)}, and apply the masked update δ_A = α·(m⊙τ_A) directly to θ_B, yielding θ_trans_B = θ_B + δ_A. No parameter updates or full fine‑tuning are performed.
The authors provide a rigorous first‑order analysis: expanding L(θ_B + δ_A) ≈ L(θ_B) + gᵀδ_A, they show that each retained coordinate contributes a non‑positive term (−|g_i||τ_A,i|), guaranteeing that for sufficiently small α the update is a descent direction. They also prove a concentration lemma for the majority‑vote estimator of the gradient sign when only N ≪ |D| samples are available, showing that the probability of sign mismatch decays exponentially with N. Hence, even a handful of examples suffices to obtain a reliable mask.
Empirically, GradFix is evaluated on vision (CIFAR‑10/100, ImageNet‑R) and language (GLUE, SQuAD) benchmarks. Compared against three baselines—naïve addition of τ_A, few‑shot fine‑tuning of the entire model, and more elaborate re‑basin methods—the proposed approach consistently achieves higher accuracy and lower loss, especially in the low‑data regime (5–20 labeled examples). In multi‑task and multi‑source merging scenarios, masked task vectors combine more harmoniously than standard task‑arithmetic or model‑soup techniques, reducing conflicts and improving overall generalization.
Ablation studies explore the effect of mask sparsity and the scaling factor α, confirming that overly large α can violate the first‑order guarantee, while moderate values retain the descent property. Because GradFix only requires gradient signs, its computational and memory overhead is negligible, making it applicable to extremely large models (e.g., 175 B‑parameter LLMs) without any additional training.
In summary, GradFix offers a simple, theoretically grounded, and empirically validated solution for transferring fine‑tuning knowledge across heterogeneous pre‑trained models. By leveraging gradient‑sign masking, it sidesteps the need for costly permutation alignment or full re‑training, dramatically reducing adaptation costs in rapidly evolving AI ecosystems and enabling effective low‑resource deployment of the latest foundation models.
Comments & Academic Discussion
Loading comments...
Leave a Comment