Chain of Correction for Full-text Speech Recognition with Large Language Models
Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC’s performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.
💡 Research Summary
The paper introduces a novel paradigm called Chain of Correction (CoC) for full‑text error correction of Automatic Speech Recognition (ASR) outputs using Large Language Models (LLMs). Unlike previous approaches that process an entire transcript at once—often outputting error‑correction pairs in JSON—CoC treats the correction as a multi‑turn chat. The full pre‑recognized transcript and a concise correction instruction are provided in the first turn. The transcript is then split into short segments (1‑5 sentences). In each subsequent turn the user supplies one segment; the assistant (the LLM) returns a corrected version, using both the segment and the whole pre‑recognized text as context. Corrected segments from earlier turns are added to the context for later turns, forming a “chain” of incremental improvements.
Key advantages of CoC are:
- Stability – By focusing on short segments, the model avoids hallucinations and over‑rephrasing that often occur when generating long outputs.
- Controllability – A “Correction Threshold” parameter governs how aggressive the correction is. After each turn the error rate between the original and corrected segment is computed; if it exceeds the threshold the correction is rejected or revised. Experiments show thresholds of 0.3–0.4 give the best trade‑off.
- Completeness – The full pre‑recognized transcript is always present, allowing the model to discover errors without needing explicit position tags, which reduces confusion and missed errors.
- Fluency – Instead of swapping isolated error tokens, the model regenerates the whole segment, leveraging the LLM’s next‑token prediction to produce naturally fluent text.
The authors fine‑tuned a 7‑billion‑parameter internal LLM (Hunyuan‑7B‑Dense‑Pretrain‑256k‑V2) on the Chinese Full‑text Error Correction Dataset (ChFT), which contains 41,651 articles generated via a TTS‑ASR pipeline. Training used 16 A100 GPUs for roughly one epoch. Evaluation was performed on three test splits: Homogeneous (general), Hard (noisy, challenging), and Up‑to‑date (articles published after July 2024, unseen during pre‑training). CoC was compared against the baseline ASR, a prior segment‑JSON method, and a massive 671‑billion‑parameter model (DeepSeek‑R1).
Results (Table 1) show that CoC consistently outperforms all baselines. For the Homogeneous set, Mandarin error rate drops from 6.16 % (baseline) to 4.06 % (CoC), a 34.09 % relative reduction; overall error reduction reaches 44.25 %. Similar gains are observed on Hard and Up‑to‑date sets, with CoC still achieving a 29.82 % reduction on the newest data, demonstrating good generalization. DeepSeek‑R1 underperforms due to over‑rephrasing, highlighting the value of task‑specific fine‑tuning even with smaller models.
The paper also investigates the effect of the Correction Threshold. Raising the threshold increases the proportion of accepted corrections and generally improves performance, but overly high values cause over‑rephrasing and a slight dip in accuracy. A threshold of 0.3 is adopted for subsequent experiments.
To test scalability, the authors built an extra‑long test set from the IndustryCorpus2 collection, selecting 100 articles with at least 12 k characters (up to ~80 k characters, ~4 h audio). The model’s 256 k token context window allowed processing messages up to ~160 k tokens per article. On this set, CoC reduces Mandarin error rate by 18.48 % and overall error by 13.58 %, confirming its applicability to very long documents.
An additional experiment replaces the original hypothesis guidance with pinyin representations of the pre‑recognized segments. While pinyin‑guided CoC yields slightly higher error rates than hypothesis‑guided CoC, it still surpasses the baseline, suggesting that phonetic or other speech‑derived cues can serve as useful auxiliary signals.
Qualitative analysis highlights CoC’s ability to fix VAD‑induced premature punctuation, restore special Chinese punctuation (e.g., 《》), remove filler words and repetitions, correct case in code‑switched English, resolve coreferences, and amend named entities—all tasks that are difficult for sentence‑level methods.
In conclusion, Chain of Correction offers a robust, controllable, and fluent framework for full‑text ASR error correction. It achieves substantial error reductions across diverse test conditions, scales to extra‑long inputs, and can incorporate alternative guidance signals such as pinyin. Future work will explore multilingual extensions, dynamic segment sizing, integration of external knowledge sources (search engines, user history), and broader objectives like spoken‑to‑written conversion.
Comments & Academic Discussion
Loading comments...
Leave a Comment