asr_eval: Algorithms and tools for multi-reference and streaming speech recognition evaluation
We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.
💡 Research Summary
This paper presents significant advancements in the methodology and tooling for automatic speech recognition (ASR) evaluation, addressing key limitations in handling linguistic diversity, speech disfluencies, and streaming scenarios.
The core technical contribution is the “MWER” string alignment algorithm, an extension of the classic Needleman-Wunsch algorithm. MWER introduces three critical enhancements for practical ASR evaluation. First, it natively supports multi-reference ground truth transcriptions using a syntax like {option1|option2|option3}, allowing multiple acceptable orthographic variants (e.g., numerals vs. words, different inflections, minor typos) to be evaluated without relying on potentially error-prone text normalization. Second, it incorporates a wildcard symbol <*> that can align with any sequence of words without penalty, effectively allowing evaluators to mark poorly heard or indecipherable speech segments and prevent annotation bias from affecting scores. Third, it improves word-to-word alignment by using a tuple-based scoring system. When multiple alignments yield the same minimum word error count, the algorithm prioritizes those with a higher number of correct word matches and lower aggregate character error rate (CER) for those matches. This leads to more intuitive alignments crucial for streaming latency analysis and visual comparison. Additionally, a relaxed penalty for long, oscillatory insertions is proposed to stabilize WER/CER metrics against rare but catastrophic hallucinations from generative models.
Empirically, the authors collect and release “DiverseSpeech-Ru,” a novel dataset of long-form, in-the-wild Russian speech annotated with the proposed multi-reference and wildcard syntax. They further re-annotate an existing popular Russian test set using the same principles. A key finding from fine-tuning experiments on this data is that models exhibit different learning dynamics and final performance when evaluated on the original (normalized) dataset versus the multi-reference re-annotated version. This suggests that models can overfit to dataset-specific labeling conventions, creating an illusion of metric improvement that may be mistaken for genuine model advancement. This underscores the importance of multi-reference evaluation for fair model comparison, especially for non-Latin languages with rich morphology.
To operationalize these ideas, the authors develop asr_eval, an open-source Python library. It provides a comprehensive suite of tools including: the full evaluation pipeline with the MWER algorithm; an interactive dashboard for visualizing and comparing multiple model transcriptions with error highlighting; a unified wrapper interface for numerous offline and streaming ASR models to facilitate consistent inference; building blocks for handling long-form audio (e.g., VAD wrappers); and specialized tools for streaming ASR evaluation, such as time-remapping for efficiency and streaming alignment diagrams for per-sample analysis.
In summary, this work elevates ASR evaluation from a simplistic string edit distance calculation to a more nuanced framework that acknowledges the inherent variability and uncertainty in real-world speech. The proposed algorithms, datasets, and tools address critical gaps in evaluating ASR systems for diverse languages and challenging conditions, paving the way for more robust and reliable assessment in both research and production settings.
Comments & Academic Discussion
Loading comments...
Leave a Comment