Grow, Don't Overwrite: Fine-tuning Without Forgetting

Grow, Don't Overwrite: Fine-tuning Without Forgetting
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Adapting pre-trained models to specialized tasks often leads to catastrophic forgetting, where new knowledge overwrites foundational capabilities. Existing methods either compromise performance on the new task or struggle to balance training stability with efficient reuse of pre-trained knowledge. We introduce a novel function-preserving expansion method that resolves this dilemma. Our technique expands model capacity by replicating pre-trained parameters within transformer submodules and applying a scaling correction that guarantees the expanded model is mathematically identical to the original at initialization, enabling stable training while exploiting existing knowledge. Empirically, our method eliminates the trade-off between plasticity and stability, matching the performance of full fine-tuning on downstream tasks without any degradation of the model’s original capabilities. Furthermore, we demonstrate the modularity of our approach, showing that by selectively expanding a small subset of layers we can achieve the same performance as full fine-tuning at a fraction of the computational cost.


💡 Research Summary

The paper tackles the pervasive problem of catastrophic forgetting that occurs when large pre‑trained language models are fine‑tuned on downstream tasks. Existing solutions fall into two categories: regularization‑based methods that penalize deviation from the original parameters (trading off capacity for stability) and capacity‑growth methods that add new parameters while freezing the original model (often requiring function‑preserving initialization and ignoring the knowledge already embedded in the pre‑trained weights). The authors propose a novel function‑preserving expansion technique that bridges this gap by expanding only the MLP sub‑modules of each Transformer layer.

The core idea is mathematically simple yet powerful. For a given layer’s MLP, the up‑projection matrix (W^{(1)}\in\mathbb{R}^{h\times p}) is duplicated (k) times and concatenated horizontally, producing an expanded matrix (\hat W^{(1)}\in\mathbb{R}^{h\times kp}). To keep the overall function unchanged, the down‑projection matrix (W^{(2)}\in\mathbb{R}^{p\times h}) is duplicated (k) times vertically and each copy is scaled by (1/k), yielding (\hat W^{(2)}\in\mathbb{R}^{kp\times h}). Because the activation after the up‑projection is simply repeated (k) times, the scaled down‑projection exactly averages the contributions, resulting in the same output as the original MLP. Thus, at initialization the expanded model is mathematically identical to the base model, guaranteeing function preservation.

Two fine‑tuning regimes are explored. G‑Freeze freezes all original parameters and trains only the newly added weights in both (\hat W^{(1)}) and (\hat W^{(2)}). This strategy preserves the original knowledge completely while allowing the model to acquire new skills. G‑Train is designed for cognitively demanding tasks; it trains the entire expanded up‑projection matrix while keeping the down‑projection (and all original parameters) frozen, based on the hypothesis that factual knowledge resides mainly in the down‑projection.

Experiments are conducted on the 1‑billion‑parameter Gemma‑3‑1B model across four downstream tasks with varying domain shifts: English‑French translation (mtnt), scientific entailment (SciTail), science question answering (QASC), and mathematical reasoning (MathQA). Knowledge retention is measured using the WinoGrande benchmark. Results show that G‑Freeze matches or exceeds standard supervised fine‑tuning (SFT) on the new tasks while preserving WinoGrande performance almost perfectly, effectively eliminating catastrophic forgetting. G‑Train outperforms G‑Freeze on MathQA, confirming that additional plasticity in the up‑projection benefits complex reasoning tasks.

A key practical contribution is the demonstration of parameter efficiency. Expanding all layers doubles the MLP hidden dimension, but only about 60 % of the original parameters become trainable (the new copies). Moreover, by selecting a small subset of layers—identified via a simple heuristic that ranks layers by the magnitude of weight updates during a brief SFT run—the authors achieve comparable performance while reducing trainable parameters to roughly 30 % of the full model. This modularity enables substantial computational savings.

Further analysis reveals scaling behavior: performance on downstream tasks improves monotonically with the number of expanded layers, especially for high‑complexity tasks like MathQA. The authors also examine the rank of the weight‑update matrices during fine‑tuning. High‑rank updates are localized to a few layers for simpler tasks (entailment, translation) but spread across almost all layers for MathQA, explaining why broader expansion is beneficial for reasoning‑heavy tasks.

In summary, the paper makes four major contributions: (1) a mathematically guaranteed function‑preserving expansion method that reuses pre‑trained weights, (2) empirical evidence that this method matches SFT performance while completely preventing catastrophic forgetting, (3) a modular framework that achieves full performance with a fraction of trainable parameters, and (4) a thorough analysis of how expansion size and task complexity interact. Limitations include the focus on MLP sub‑modules (the approach has not yet been extended to attention heads) and the fixed expansion factor (k=2) used throughout experiments. Future work could explore applying the same principle to attention layers, automating layer‑selection, and investigating non‑integer or adaptive expansion factors. Overall, the proposed technique offers a simple, theoretically sound, and practically effective solution to the stability‑plasticity dilemma in fine‑tuning large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment