Routing without Forgetting

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Continual learning in transformers is commonly addressed through parameter-efficient adaptation: prompts, adapters, or LoRA modules are specialized per task while the backbone remains frozen. Although effective in controlled multi-epoch settings, these approaches rely on gradual gradient-based specialization and struggle in Online Continual Learning (OCL), where data arrive as a non-stationary stream and each sample may be observed only once. We recast continual learning in transformers as a routing problem: under strict online constraints, the model must dynamically select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. We thus introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy-based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing or merging task-specific prompts, RwF generates dynamic prompts through single-step associative retrieval over the transformer token embeddings at each layer. Retrieval corresponds to the closed-form minimization of a strictly convex free-energy functional, enabling input-conditioned routing within each forward pass, independently of iterative gradient refinement. Across challenging class-incremental benchmarks, RwF improves over existing prompt-based methods. On Split-ImageNet-R and Split-ImageNet-S, RwF outperforms prior prompt-based approaches by a large margin, even in few-shot learning regimes. These results indicate that embedding energy-based associative routing directly within the transformer backbone provides a principled and effective foundation for OCL.

💡 Research Summary

In this paper the authors address the challenging problem of online continual learning (OCL) for vision transformers, where data arrive as a non‑stationary stream and each sample can be seen only once. Traditional parameter‑efficient adaptation methods—prompt pools, adapters, LoRA—rely on gradual, gradient‑driven specialization of task‑specific modules while keeping the backbone frozen. Such approaches work well in multi‑epoch incremental settings but struggle under strict OCL constraints because there is insufficient time for task‑specific parameters to converge before the data distribution shifts.

The authors propose to view continual learning in transformers as a routing problem: the model must instantly select the appropriate representational subspace for each input without explicit task identifiers or repeated optimization. To this end they introduce Routing without Forgetting (RwF), a transformer architecture augmented with energy‑based associative retrieval layers inspired by Modern Hopfield Networks. Instead of storing static prompts, RwF generates dynamic prompts on the fly via a single‑step associative retrieval over the token embeddings at each transformer layer.

Core Mechanism

At layer ℓ the token matrix (Z^{\ell}\in\mathbb{R}^{L\times d}) is first projected to keys (K^{\ell}=Z^{\ell}W_K) and values (V^{\ell}=Z^{\ell}W_V). A small set of learnable query vectors (Q^{\ell}\in\mathbb{R}^{m\times d}) (with (m\ll L)) is also projected to (\tilde Q^{\ell}=Q^{\ell}W_Q). The associative routing operator computes a routing matrix

Routing without Forgetting

💡 Research Summary

Core Mechanism

Comments & Academic Discussion

Leave a Comment