CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
💡 Research Summary
The paper introduces CALM (Contextual Acoustic‑Linguistic Modeling), an end‑to‑end framework that jointly leverages speaker‑specific acoustic cues and dynamic linguistic biasing to improve multi‑speaker automatic speech recognition (ASR). Traditional approaches treat acoustic and linguistic challenges separately: acoustic models focus on speaker attribution and separation, while contextual biasing methods inject domain‑specific vocabularies through shallow fusion or external language models. In overlapping‑speech scenarios, however, these two sources of error are tightly coupled—mis‑attributing speech to the wrong speaker often leads to incorrect recognition of speaker‑relevant terms, especially rare names or technical jargon. CALM addresses this coupling by conditioning the acoustic encoder on a target‑speaker embedding and simultaneously expanding the output vocabulary with a dynamic, bias‑list‑driven token set.
The system pipeline consists of four main components. First, frozen WavLM‑Large extracts frame‑level acoustic features from the mixed signal. Second, a speaker encoder (ECAPA‑TDNN followed by a RawNet3 projector) processes a short enrollment utterance to produce a fixed‑dimensional speaker embedding E_s. This embedding modulates every Conformer encoder layer via FiLM (Feature‑wise Linear Modulation), effectively amplifying target‑speaker characteristics while suppressing interference from other speakers. Third, a bias encoder—a lightweight Transformer—encodes a biasing list B (e.g., rare words, proper nouns) into a set of vectors V, each representing a dynamic token. The static vocabulary V_stat and the dynamic token set V_d‑vocab are concatenated, and two parallel linear projections generate logits for static and dynamic tokens. A weighted softmax with a bias weight μ balances the probability mass between the two vocabularies, preventing over‑biasing of dynamic tokens. Fourth, the model is trained with a multitask loss: CTC (including self‑conditioning at intermediate layers), attention‑based sequence loss, and an auxiliary voice‑activity‑detection (VAD) loss that regularizes frame‑level speaker activity predictions.
Experiments were conducted on three corpora: LibriSpeechMix (English 2‑ and 3‑speaker mixtures), CSJMix (Japanese 2‑ and 3‑speaker mixtures), and AMI (real meeting recordings with 4‑5 speakers). Enrollment utterances of 5 seconds (or 15 seconds for AMI) were used to obtain speaker embeddings. Biasing lists of varying sizes (N = 0–2000) were constructed per speaker (or per utterance for AMI) by selecting rare words/characters from the training set and adding distractors. Evaluation metrics included overall WER, unbiased WER (U‑WER), and biased WER (B‑WER), where B‑WER counts errors on words present in the bias list.
Results demonstrate substantial gains. On LibriSpeech2Mix, the baseline target‑speaker ASR (A1) achieved a B‑WER of 12.7 %; adding VAD (A2) reduced it marginally to 12.7 %. CALM variants with dynamic vocabularies (A3) lowered B‑WER to 4.3 % (N = 100) and to 4.7 % (N = 2000). The full CALM model with both dynamic vocabularies and VAD (A4) further improved to 3.6 % B‑WER at N = 2000, while maintaining competitive overall WER. Similar trends were observed for LibriSpeech3Mix and for Japanese CSJMix2, where B‑CER dropped from 16.6 % to 8.4 % with CALM. On the AMI IHM‑mix condition, CALM also outperformed prior state‑of‑the‑art multi‑speaker ASR systems, confirming its robustness in real‑world meeting scenarios.
Ablation studies on the bias weight μ revealed that μ = 0.1 offers the best trade‑off between overall WER and bias‑specific accuracy; larger μ values overly favor dynamic tokens, improving B‑WER but harming overall performance, while smaller μ values preserve overall WER but under‑utilize the bias list. Error analysis showed that CALM primarily reduces substitution errors on biasing words, indicating that the speaker‑conditioned acoustic features help the model correctly associate rare lexical items with the appropriate speaker.
In summary, CALM presents a novel integration of speaker‑aware acoustic modeling and dynamic linguistic biasing within a single end‑to‑end architecture. By feeding speaker embeddings directly into the dynamic vocabulary layer, the framework aligns acoustic and linguistic information, yielding significant reductions in both overall and bias‑specific error rates across languages and datasets. The work opens avenues for further research, such as scaling to larger language models, exploring streaming inference, and extending to more diverse conversational domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment