When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index models

When pre-training hurts LoRA fine-tuning: a dynamical analysis via single-index models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Pre-training on a source task is usually expected to facilitate fine-tuning on similar downstream problems. In this work, we mathematically show that this naive intuition is not always true: excessive pre-training can computationally slow down fine-tuning optimization. We study this phenomenon for low-rank adaptation (LoRA) fine-tuning on single-index models trained under one-pass SGD. Leveraging a summary statistics description of the fine-tuning dynamics, we precisely characterize how the convergence rate depends on the initial fine-tuning alignment and the degree of non-linearity of the target task. The key take away is that even when the pre-training and down- stream tasks are well aligned, strong pre-training can induce a prolonged search phase and hinder convergence. Our theory thus provides a unified picture of how pre-training strength and task difficulty jointly shape the dynamics and limitations of LoRA fine-tuning in a nontrivial tractable model.


💡 Research Summary

**
The paper investigates a counter‑intuitive phenomenon: excessive pre‑training can actually slow down the fine‑tuning of large models when low‑rank adaptation (LoRA) is used. While the community generally assumes that a well‑aligned pre‑trained representation always accelerates downstream learning, the authors demonstrate that, under certain conditions, stronger pre‑training prolongs the early “search phase” of stochastic gradient descent (SGD) and thus increases the sample complexity required for convergence.

To obtain a mathematically tractable setting, the authors consider a teacher–student framework based on single‑index models. The teacher’s target function is (f^\star(x)=\phi(\omega^\star!\cdot!x)), where (\omega^\star) is a unit vector in (\mathbb{R}^d) and (\phi) is a scalar activation. The pre‑trained weight matrix is modeled as a single direction (\tilde\omega = \mu,\omega^\star), with (\mu\in(0,1)) quantifying the alignment between the pre‑trained representation and the downstream task. LoRA fine‑tuning is represented by a rank‑one correction (u\omega), where (u\in\mathbb{R}) is a scalar and (\omega\in\mathbb{S}^{d-1}) is a unit vector that is learned together with (u). The student predictor therefore reads \


Comments & Academic Discussion

Loading comments...

Leave a Comment