Continual Learning for Monolingual End-to-End Automatic Speech Recognition
Adapting Automatic Speech Recognition (ASR) models to new domains results in a deterioration of performance on the original domain(s), a phenomenon called Catastrophic Forgetting (CF). Even monolingual ASR models cannot be extended to new accents, dialects, topics, etc. without suffering from CF, making them unable to be continually enhanced without storing all past data. Fortunately, Continual Learning (CL) methods, which aim to enable continual adaptation while overcoming CF, can be used. In this paper, we implement an extensive number of CL methods for End-to-End ASR and test and compare their ability to extend a monolingual Hybrid CTC-Transformer model across four new tasks. We find that the best performing CL method closes the gap between the fine-tuned model (lower bound) and the model trained jointly on all tasks (upper bound) by more than 40%, while requiring access to only 0.6% of the original data.
💡 Research Summary
This paper investigates continual learning (CL) for monolingual end‑to‑end (E2E) automatic speech recognition (ASR), focusing on the catastrophic forgetting (CF) problem that arises when a model is adapted to new domains such as different accents, dialects, or topics. The authors use a hybrid CTC‑Transformer architecture (CTC loss weighted 0.3, decoder cross‑entropy 0.7) and evaluate a wide range of CL strategies on four sequential tasks derived from the Dutch Corpus Gesproken Nederlands (CGN): NL‑main, VL‑main, NL‑rest, and VL‑rest.
Two families of CL methods are implemented: regularization‑based (Elastic Weight Consolidation, Memory‑Aware Synapses, Continual learning with Sampled Quasi‑Newton, and Learning Without Forgetting) and replay‑based (Experience Replay, weighted ER, Batch‑level ER, A‑GEM, and Knowledge Distillation). A novel hyper‑parameter selection scheme for the regularization weight λ is proposed, which does not require any validation data from previous tasks; instead it uses the TER on the new task before and after a short adaptation to automatically adjust λ.
Memory constraints are realistic: after each task, only 500 utterances (≈0.6 % of the original training data) are stored. The authors evaluate models using Average Word Error Rate (A WER), Backward Transfer (BWT), Forward Transfer (FWT), Coverage (CO V) – the percentage of the gap closed between fine‑tuning (FT, lower bound) and continued joint training (CJT, upper bound) – and storage overhead measured in model equivalents.
Results show that regularization‑based methods struggle in this setting. EWC and MAS actually perform worse than the fine‑tuned baseline, while CSQN variants only achieve marginal improvements. The difficulty stems from the high similarity between tasks, which leads to overlapping important parameters; preserving them either blocks learning of new tasks or causes forgetting of old ones. LWF improves learning of new tasks but, lacking any replay, reduces forgetting only modestly (CO V ≈ 12 %).
Replay‑based approaches benefit significantly from the small memory. Knowledge Distillation (KD) – which applies the same teacher‑student loss as LWF but computes the teacher outputs on memory samples – achieves the best overall performance: A WER = 25.0 % (vs. 27.3 % for FT), BWT = ‑1.2 % (substantial reduction of forgetting), and CO V ≈ 42 %, closing more than 40 % of the FT–CJT gap. KD also maintains the forward transfer of FT (FWT ≈ 0) while dramatically improving backward transfer. ER with a weighting factor (ER(λ)) and BER also improve over FT, but ER without weighting overfits the memory (0 % WER on memory but poor test performance). A‑GEM learns new tasks reasonably well but suffers from severe forgetting (CO V ≈ 22 %).
A second set of experiments (Table II) examines how well each method preserves the original task (NL‑main) after learning VL‑main. KD shows a modest increase in memory WER (10.5 % vs. 10.7 % for the original model) but a clear reduction in test WER (29.4 % vs. 33.0 %). In contrast, ER memorizes the stored utterances perfectly (0 % memory WER) yet fails to generalize (32.7 % test WER).
The authors also test a fixed‑size memory (still 500 utterances total) across all four tasks. KD, ER(λ), and A‑GEM show only minor performance changes compared with the growing‑memory scenario, confirming that even a tiny, fixed memory (≈0.2 % of total data after four tasks) suffices for effective CL. Storage overhead remains low (≈1–3 model equivalents, i.e., ~105 MB per model).
Key insights: (1) regularization alone is insufficient when tasks are highly similar; (2) replay with a very small memory dramatically mitigates CF; (3) applying knowledge distillation on replayed samples (KD) yields the best trade‑off between learning new tasks and retaining old knowledge; (4) the proposed λ‑selection method works without needing past validation data, making the approach realistic for production systems.
In conclusion, the paper provides a comprehensive benchmark of CL techniques for monolingual E2E ASR, demonstrates that replay‑based methods—especially KD—can close a large portion of the performance gap between fine‑tuning and joint training while using only a fraction of the original data. Future work could explore smarter memory selection, dynamic memory sizing, multi‑language or multi‑modal extensions, and integration with external language models.
Comments & Academic Discussion
Loading comments...
Leave a Comment