SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Supervised Fine-Tuning (SFT) on domain-specific datasets is a common approach to adapt Large Language Models (LLMs) to specialized tasks but is often believed to degrade their general capabilities. In this work, we revisit this trade-off and present both empirical and theoretical insights. First, we show that SFT does not always hurt: using a smaller learning rate can substantially mitigate general performance degradation while preserving comparable target-domain performance. We then provide a theoretical analysis that explains these phenomena and further motivates a new method, Token-Adaptive Loss Reweighting (TALR). Building on this, and recognizing that smaller learning rates alone do not fully eliminate general-performance degradation in all cases, we evaluate a range of strategies for reducing general capability loss, including L2 regularization, LoRA, model averaging, FLOW, and our proposed TALR. Experimental results demonstrate that while no method completely eliminates the trade-off, TALR consistently outperforms these baselines in balancing domain-specific gains and general capabilities. Finally, we distill our findings into practical guidelines for adapting LLMs to new domains: (i) using a small learning rate to achieve a favorable trade-off, and (ii) when a stronger balance is further desired, adopt TALR as an effective strategy.

💡 Research Summary

This paper revisits the widely observed trade‑off between domain‑specific supervised fine‑tuning (SFT) of large language models (LLMs) and the degradation of their general capabilities. While many recent works have reported substantial drops in performance on benchmarks such as GSM8K, HumanEval, and IFEval after fine‑tuning on specialized datasets (e.g., medical or e‑commerce corpora), the authors argue that this degradation is not inevitable and is heavily influenced by the choice of learning rate.

The authors conduct systematic experiments on two low‑performing domain datasets: MedCalc (clinical note reasoning) and ESCI (product classification). They fine‑tune several open‑source LLMs (Qwen‑3‑8B, Qwen‑2.5‑7B, Gemma‑3‑4B, etc.) using three learning rates: 1e‑6, 5e‑6, and 2e‑5. Across all models, the smallest learning rate consistently yields a Pareto‑optimal point: domain accuracy remains comparable to that achieved with larger rates, while the drop in general‑purpose benchmark scores is dramatically reduced. The authors also note that when the target sequence contains only the final label (no chain‑of‑thought), the acceptable learning‑rate range widens, suggesting that the presence of intermediate reasoning steps makes the training dynamics more sensitive to step size.

To explain why small learning rates preserve general abilities, the paper offers an information‑theoretic view of LLMs as compressors. A shift from one model distribution to another changes the expected code length by the difference in KL divergences to the true data distribution. Small learning rates produce modest KL reductions, limiting catastrophic drift from the pretrained knowledge. However, even with modest updates, certain “hard tokens” (low‑probability, domain‑specific terms) can dominate the loss and cause selective over‑fitting, leading to residual general‑capability loss.

Addressing this, the authors propose Token‑Adaptive Loss Reweighting (TALR). TALR computes a per‑token difficulty estimate based on the model’s current probability versus the empirical token frequency, then solves a constrained optimization problem that yields closed‑form weights. Hard tokens receive lower weights early in training, preventing them from pulling the model away from its general knowledge. As training progresses and the model becomes more confident on these tokens, their weights automatically increase, creating a curriculum‑like dynamic.

The paper benchmarks TALR against four established mitigation strategies: L2 regularization (parameter‑space penalty), LoRA (low‑rank adapters), model averaging (convex combination of pretrained and fine‑tuned weights), and FLOW (sample‑level loss reweighting). Across all evaluated metrics—domain accuracy, balanced accuracy for ESCI, and average scores on GSM8K, HumanEval, and IFEval—TALR consistently achieves the highest combined performance, reducing general‑capability loss more than any baseline while maintaining domain gains. The authors acknowledge that no method completely eliminates the trade‑off, but TALR offers the best practical balance.

A token‑level analysis reveals two key observations: (1) most tokens in the domain datasets are “easy” for the pretrained model, contributing little to degradation; only a small subset of hard tokens drives the forgetting effect. (2) TALR’s dynamic weighting creates a smooth shift of focus from easy to hard tokens, effectively injecting domain knowledge without erasing previously learned abilities.

Finally, the authors distill their findings into actionable guidelines for practitioners: (i) start with a small learning rate (≈1e‑6 to 5e‑6) when fine‑tuning on domain data to achieve a favorable trade‑off; (ii) if a stronger balance is required—especially when hard tokens are prevalent—apply TALR as an additional loss‑reweighting step. These recommendations require no extra data, minimal computational overhead, and are compatible with standard fine‑tuning pipelines.

In summary, the paper demonstrates that careful learning‑rate selection dramatically mitigates general‑capability loss, and that the proposed token‑adaptive loss reweighting further refines this balance, establishing a simple yet effective two‑step strategy for domain‑specific adaptation of large language models.

SFT Doesn't Always Hurt General Capabilities: Revisiting Domain-Specific Fine-Tuning in LLMs

💡 Research Summary

Comments & Academic Discussion

Leave a Comment