Adapting Diarization-Conditioned Whisper for End-to-End Multi-Talker Speech Recognition
We propose a speaker-attributed (SA) Whisper-based model for multi-talker speech recognition that combines target-speaker modeling with serialized output training (SOT). Our approach leverages a Diarization-Conditioned Whisper (DiCoW) encoder to extract target-speaker embeddings, which are concatenated into a single representation and passed to a shared decoder. This enables the model to transcribe overlapping speech as a serialized output stream with speaker tags and timestamps. In contrast to target-speaker ASR systems such as DiCoW, which decode each speaker separately, our approach performs joint decoding, allowing the decoder to condition on the context of all speakers simultaneously. Experiments show that the model outperforms existing SOT-based approaches and surpasses DiCoW on multi-talker mixtures (e.g., LibriMix).
💡 Research Summary
The paper introduces a novel speaker‑attributed Whisper model (SA‑DiCoW) that unifies target‑speaker conditioning (via Diarization‑Conditioned Whisper, DiCoW) with Serialized Output Training (SOT) to tackle multi‑talker automatic speech recognition (MT‑ASR). Traditional Whisper excels on single‑speaker audio but lacks mechanisms for overlapping speech and speaker attribution. DiCoW adds diarization‑conditioned layers to focus on a single target speaker, yet it decodes each speaker independently, which is computationally inefficient and prevents the decoder from exploiting cross‑speaker context. SOT, on the other hand, enables a single decoder to emit a serialized transcript containing speaker‑change tokens, but it does not incorporate explicit speaker modeling.
SA‑DiCoW bridges these gaps by running the DiCoW encoder separately for each speaker in a mixture, using STNO (Silence, Target, Non‑target, Overlap) masks derived from diarization. Each speaker‑specific encoder output (a “speaker‑channel” embedding) is passed through a learned affine transformation that injects a global speaker identity vector. All speaker‑channel embeddings are then concatenated along the time dimension, forming a unified encoder representation H̄ that preserves the temporal structure of each speaker. The authors compare alternative aggregation strategies (weighted sum, average, masked average) and find that concatenation yields the lowest concatenated minimum‑Permutation WER (cpWER), especially in realistic meeting data with 4‑8 speakers, where it reduces cpWER by 64 % relative to weighted sum.
The decoder remains essentially the original Whisper transformer decoder, with minimal extensions: three output heads are added for lexical tokens, speaker IDs, and timestamps. When a token corresponds to a speaker‑timestamp (e.g., <|s1_0.3|>), it is first embedded as a standard timestamp token and then transformed by a speaker‑specific affine matrix, thereby implicitly encoding speaker identity. The combined logits o_spk‑time = o_spk + o_time are used to select the final speaker‑timestamp token.
Training proceeds in two stages. First, only the newly introduced layers (speaker‑channel affine transforms, speaker‑timestamp output heads) are trained for 1 000 steps while freezing the pre‑trained Whisper parameters (learning rate 2e‑4, linear warm‑up over 500 steps). Second, the entire model is fine‑tuned with a reduced learning rate (2e‑6) for the Whisper backbone, allowing the model to retain its strong linguistic knowledge while adapting to multi‑talker scenarios. To prevent the model from over‑fitting to a fixed ordering of speaker IDs, a “speaker‑order augmentation” randomly permutes the diarization‑assigned speaker labels during training, forcing the network to rely on the actual acoustic embeddings rather than token‑ID shortcuts.
Experiments are conducted on three English multi‑talker corpora: LibriMix (synthetic 2‑ and 3‑speaker mixes), AMI (meeting recordings, single distant microphone), and NOTSOFAR (real‑world meetings with 4‑8 participants). All evaluations use cpWER, which jointly penalizes word errors and speaker‑attribution mistakes. Results show that SA‑DiCoW achieves 3.9 % cpWER on Libri2Mix, 5.0 % on Libri3Mix, and 21.0 % on NOTSOFAR, outperforming prior SOT‑based Whisper variants (≈6 % on Libri2Mix) and matching or surpassing DiCoW on synthetic mixtures. On NOTSOFAR, DiCoW still yields a lower cpWER (18.4 %) because it decodes each speaker independently, reducing omission errors; however, SA‑DiCoW’s joint decoding improves handling of overlapping speech. By increasing the cross‑entropy loss weight on speaker‑timestamp tokens fivefold, the authors further reduce cpWER to 20.8 % on NOTSOFAR, demonstrating that stronger supervision on speaker changes mitigates leakage (incorrect speaker assignment) errors.
Key contributions:
- A unified architecture that extracts multiple speaker‑conditioned embeddings via DiCoW and feeds them into a single Whisper decoder, enabling joint decoding of overlapping speech.
- Introduction of speaker‑timestamp tokens with dedicated output heads, preserving Whisper’s decoding pipeline while adding speaker attribution.
- Empirical evidence that time‑wise concatenation of speaker embeddings is the most effective aggregation method for preserving speaker‑specific cues.
- Training tricks—two‑stage fine‑tuning, speaker‑order augmentation, and weighted speaker‑timestamp loss—that improve robustness to speaker label errors.
Limitations include reliance on oracle diarization during training and evaluation; the performance with automatic diarization (which may produce noisy STNO masks) remains to be quantified. The model processes audio in 30‑second chunks, which could introduce boundary artifacts in long‑form streaming scenarios. Future work should explore end‑to‑end integration with online diarization, multi‑channel microphone arrays, and real‑time inference optimizations.
Overall, SA‑DiCoW demonstrates that a modest extension of Whisper—adding diarization‑conditioned encoder branches and speaker‑aware token handling—can yield a powerful, speaker‑attributed MT‑ASR system that balances accuracy, computational efficiency, and compatibility with existing large‑scale pre‑trained speech models.
Comments & Academic Discussion
Loading comments...
Leave a Comment