Charting Empirical Laws for LLM Fine-Tuning in Scientific Multi-Discipline Learning
While large language models (LLMs) have achieved strong performance through fine-tuning within individual scientific domains, their learning dynamics in multi-disciplinary contexts remains poorly understood, despite the promise of improved generalization and broader applicability through cross-domain knowledge synergy. In this work, we present the first systematic study of multi-disciplinary LLM fine-tuning, constructing a five-discipline corpus and analyzing learning patterns of full fine-tuning, LoRA, LoRA-MoE, and LoRA compositions. Particularly, our study shows that multi-disciplinary learning is substantially more variable than single-discipline training and distills four consistent empirical laws: (1) Balance-then-Diversity: low-resource disciplines degrade performance unless mitigated via diversity-aware upsampling; (2) Merge-then-Align: restoring instruction-following ability is critical for cross-discipline synergy; (3) Optimize-then-Scale: parameter scaling offers limited gains without prior design optimization; and (4) Share-then-Specialize: asymmetric LoRA-MoE yields robust gains with minimal trainable parameters via shared low-rank projection. Together, these laws form a practical recipe for principled multi-discipline fine-tuning and provide actionable guidance for developing generalizable scientific LLMs.
💡 Research Summary
This paper presents the first systematic investigation of multi‑disciplinary fine‑tuning for scientific large language models (LLMs). The authors construct a corpus that spans five scientific domains—mathematics, chemistry, biology, medicine, and geography—totaling 3.3 million text samples (≈188 million unique tokens) with highly imbalanced data sizes (e.g., 2 M math samples versus 40 K geography samples). Using Qwen2.5‑7B Instruct as the base model, they evaluate four fine‑tuning strategies: (1) full‑model tuning (FT), which updates all parameters; (2) LoRA, a low‑rank adaptation method that trains only small projection matrices; (3) LoRA‑MoE, which combines multiple LoRA experts with a lightweight gating network; and (4) LoRA‑Comp, which composes pre‑trained discipline‑specific LoRA adapters and trains only a router to blend them. All experiments share the same hyper‑parameters (learning rates, weight decay, one epoch) and LoRA rank = 16; the number of experts equals the number of disciplines (five).
Evaluation is performed on in‑domain benchmarks: GSM8K for mathematics, ChemBench for chemistry, Mol‑Instruction for biology, MedMCQA for medicine, and GeoBench for geography. Accuracy is the primary metric. Results show that single‑discipline fine‑tuning scales smoothly with data size, achieving steady improvements as more data are added. In contrast, multi‑disciplinary fine‑tuning exhibits markedly higher variance, especially for low‑resource fields (medicine, biology, geography), and on average yields lower accuracy than the single‑discipline baseline. Full‑model tuning performs surprisingly poorly in the multi‑disciplinary setting despite updating the largest number of parameters; it suffers from conflicting gradient signals and over‑fitting to dominant data sources. Among parameter‑efficient methods, LoRA‑Comp provides the most stable learning curves because it only trains a lightweight router, yet its limited capacity restricts cross‑disciplinary interaction, leading to lower peak performance than LoRA or LoRA‑MoE trained from scratch on the aggregated data.
From these observations the authors distill four empirical “laws” that together form a practical recipe for multi‑disciplinary scientific LLM fine‑tuning:
-
Balance‑then‑Diversity – Low‑resource disciplines disproportionately harm overall learning. Simple duplication up‑sampling is insufficient; instead, diversity‑aware up‑sampling that preserves topic variety while balancing contributions mitigates degradation.
-
Merge‑then‑Align – Multi‑disciplinary fine‑tuning tends to erode the model’s instruction‑following ability. Mixing a modest proportion of general instruction data (e.g., Alpaca‑style prompts) restores alignment and unlocks synergistic transfer across domains.
-
Optimize‑then‑Scale – Scaling the number of trainable parameters yields marginal gains unless the underlying architecture and training schedule are first optimized (e.g., choosing appropriate LoRA rank, expert count, gating design). Parameter scaling should follow, not precede, such design optimizations.
-
Share‑then‑Specialize – Asymmetric parameter sharing in LoRA‑MoE (e.g., sharing the A projection across experts while allowing expert‑specific B projections) first encourages cross‑discipline knowledge sharing and then permits stable expert specialization. This approach attains performance comparable to full‑model fine‑tuning while using only a small fraction of trainable parameters.
The paper concludes that multi‑disciplinary LLM fine‑tuning cannot be achieved by naïvely aggregating data or enlarging model size. Instead, a staged strategy—balancing data, preserving diversity, re‑aligning instruction behavior, optimizing architecture before scaling, and employing asymmetric shared‑expert MoE—provides a robust pathway toward generalizable scientific language models. The four empirical laws offer actionable guidance for future research aiming to build foundation models that seamlessly integrate knowledge across disparate scientific fields.
Comments & Academic Discussion
Loading comments...
Leave a Comment