Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention
Foundational Vision-Language Models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled testtime data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple Teacher-Student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.
💡 Research Summary
The paper introduces a novel continual learning framework called DoSAPP (Double Smoothing via Affine Projected Parameters) that tackles catastrophic forgetting in class‑incremental learning (CIL) without relying on any replay buffer. The key insight is to exploit the unlabeled test‑time data that naturally arrives during deployment to reinforce knowledge of previously learned tasks. DoSAPP operates in two alternating phases: a supervised phase where a Vision‑Language model (based on CLIP) is fine‑tuned on the current task, and an unsupervised test‑time phase where the model processes each incoming sample once, discarding it afterward.
During the supervised phase, only a sparse subset (≈10 % of parameters) of the model is updated. The authors focus on the first MLP layer of each transformer block, scoring each parameter by the magnitude of its gradient on the current task’s loss. The top‑K parameters form a binary mask m, and only these parameters are updated via SGD. This sparsity limits the drift of the bulk of the network, preserving generic representations learned from the large pre‑training corpus.
A teacher‑student architecture is employed to stabilize updates. The student (M_S) receives the sparse updates, while the teacher (M_T) tracks the student through an exponential moving average (EMA). Crucially, the EMA uses dual momentum: parameters that are being actively updated (as indicated by mask m) receive a larger smoothing coefficient (δ), frozen parameters receive a smaller one (γ), and during the unsupervised phase an intermediate coefficient (λ) is used. This affine‑projected EMA allows the teacher to adapt quickly where needed while remaining inert elsewhere, mitigating the “drift‑vs‑stability” trade‑off.
In the unsupervised test‑time phase, each incoming sample x_i is passed through both teacher and student. The model compares the confidence (max logit) of the two; the higher‑confidence prediction becomes a pseudo‑label ȳ_i. The student then updates the same sparse parameter set using this pseudo‑label, and the teacher’s EMA is refreshed with the dual‑momentum rule. Because each sample is used only once, the method respects privacy constraints and incurs negligible storage overhead.
Algorithm 1 formalizes the whole pipeline: (1) select sparse parameters, (2) supervised SGD on current task, (3) EMA update, (4) accumulate masks across tasks, (5) online pseudo‑labeling on test‑time data, (6) second EMA update.
The authors evaluate DoSAPP on several CIL benchmarks (CIFAR‑100, ImageNet‑R, and domain‑shifted variants) with 5, 10, and 20 incremental tasks. They compare against a suite of replay‑based and test‑time adaptation baselines (EcoTT‑A, RMT, PSMT, etc.). DoSAPP consistently achieves higher average accuracy and lower forgetting rates, even though it uses zero memory. For example, in the 20‑task scenario DoSAPP improves average accuracy by 3–5 percentage points over the best replay‑free baseline and reduces forgetting by more than 30 %.
Ablation studies reveal that (a) the exact sparsity level (5 %–20 %) has limited impact, confirming robustness to the choice of c, (b) dual‑momentum is essential—using a single EMA degrades performance sharply, and (c) the pseudo‑label selection rule (choosing the higher‑confidence prediction between teacher and student) yields more stable updates than using either model alone.
Limitations include dependence on the representativeness of test‑time streams (highly biased streams could generate poor pseudo‑labels), the current focus on CLIP’s image‑text architecture (extension to pure vision or pure language models would need adaptation), and the fact that only one SGD step per sample is performed, which may slow convergence on very large or high‑dimensional data.
Overall, DoSAPP proposes a practical “supervised‑unsupervised interleaved” learning paradigm that enables continual adaptation in memory‑constrained, privacy‑sensitive settings such as robotics or edge devices. Future work could explore confidence‑based filtering of pseudo‑labels, multi‑modal extensions, and cryptographically secure test‑time learning.
Comments & Academic Discussion
Loading comments...
Leave a Comment