Adapt before Continual Learning
Continual Learning (CL) seeks to enable neural networks to incrementally acquire new knowledge (plasticity) while retaining existing knowledge (stability). Although pre-trained models (PTMs) have provided a strong foundation for CL, existing approaches face a fundamental challenge in balancing these two competing objectives. Current methods typically address stability by freezing the PTM backbone, which severely limits the model’s plasticity, particularly when incoming data distribution diverges largely from the pre-training data. Alternatively, sequentially fine-tuning the entire PTM can adapt to new knowledge but often leads to catastrophic forgetting, highlighting the critical stability-plasticity trade-off in PTM-based CL. To address this limitation, we propose Adapting PTMs before the core CL} process (ACL), a novel framework that introduces a plug-and-play adaptation phase prior to learning each new task. During this phase, ACL refines the PTM backbone by aligning embeddings with their original class prototypes while distancing them from irrelevant classes. This mechanism theoretically and empirically demonstrates desirable balance between stability and plasticity, significantly improving CL performance across benchmarks and integrated methods. Code is available at https://github.com/byyx666/ACL_code.
💡 Research Summary
Continual learning (CL) aims to enable neural networks to acquire new knowledge sequentially while preserving previously learned information. Existing CL approaches fall into three broad categories: replay‑based methods that store a subset of past data, regularization‑based methods that constrain weight updates, and architecture‑based methods that allocate separate parameters for each task. With the rise of large pre‑trained models (PTMs), a new line of PTM‑based CL has emerged. Most of these methods freeze the PTM backbone to retain its general knowledge and only train lightweight modules such as prompts or adapters. While this yields strong stability, it often sacrifices plasticity because the frozen feature space may be suboptimal for downstream tasks, especially when the incoming data distribution diverges from the pre‑training data. Conversely, fully fine‑tuning the PTM improves plasticity but typically leads to catastrophic forgetting, destroying the very stability that PTMs provide.
The paper introduces “Adapt before Continual Learning” (ACL), a two‑phase framework designed to reconcile this stability‑plasticity dilemma. For each new task k, ACL first performs an adaptation phase on the PTM backbone (ϕ) and any existing lightweight modules (Θ) using only the current task’s data Dk. The adaptation loss, called LACL, encourages each adapted embedding ϕ∗(x) to be close (high cosine similarity) to its class prototype py and far (low similarity) from prototypes of other classes. Formally, LACL is a temperature‑scaled softmax cross‑entropy over cosine similarities, with all vectors ℓ2‑normalized on the unit hypersphere. Prototypes are computed as the mean of the frozen backbone’s embeddings for each class.
Two theoretical results underpin the method. Proposition 1 shows that the probability of misclassifying a sample is upper‑bounded by the expected LACL divided by log 2; thus minimizing LACL directly reduces an upper bound on the current‑task error, enhancing plasticity. Proposition 2 demonstrates that LACL implicitly regularizes the deviation between the adapted and original features. By bounding the squared Euclidean distance ‖ϕ∗(x)−ϕ(x)‖² with terms involving distances to the class prototype, and noting that LACL minimizes the distance between adapted features and their prototypes, the authors prove that the expected feature drift is tightly controlled, preserving stability.
After adaptation, the adapted backbone ϕ∗k−1 is frozen. The second phase proceeds exactly as any existing CL algorithm would: the classification head C and the lightweight modules Θ∗k−1 are fine‑tuned on Dk while the backbone remains unchanged. This design makes ACL a plug‑and‑play wrapper that can be combined with a wide range of CL methods (e.g., L2P, DualPrompt, SSIA‑T, MOS) without altering their core mechanisms.
Empirical evaluation spans several benchmarks, including ImageNet‑A‑Inc20, CIFAR‑100, and TinyImageNet, and integrates ACL with multiple state‑of‑the‑art CL approaches. Across all settings, ACL consistently improves three key metrics: (1) plasticity, measured as the average optimal accuracy per task; (2) stability, measured as average forgetting after the final task; and (3) overall CL performance, measured as the final average accuracy. Reported gains range from 2 to 5 percentage points in overall accuracy, with forgetting reduced by roughly 10 % relative to the frozen‑backbone baseline. Notably, the prototype‑based adaptation achieves these improvements without requiring replay buffers or additional memory, highlighting its efficiency.
The authors acknowledge certain limitations. Computing class prototypes incurs a cost linear in the number of classes, which may become significant for very large‑scale problems. The adaptation phase adds extra training time for each task, potentially increasing overall training duration. Moreover, the current formulation assumes a cosine‑based classifier and ℓ2‑normalized embeddings; extending the method to other similarity measures or unnormalized feature spaces may require further adaptation.
In summary, ACL offers a simple yet effective solution to the stability‑plasticity trade‑off in PTM‑based continual learning. By inserting a lightweight, prototype‑driven adaptation step before each task, it realigns the feature space to better suit new data while keeping the backbone’s original knowledge intact. The framework’s modularity allows seamless integration with existing CL pipelines, making it a promising direction for future research aimed at scalable, memory‑efficient continual learning with large pre‑trained models.
Comments & Academic Discussion
Loading comments...
Leave a Comment