Huntington Disease Automatic Speech Recognition with Biomarker Supervision
Automatic speech recognition (ASR) for pathological speech remains underexplored, especially for Huntington’s disease (HD), where irregular timing, unstable phonation, and articulatory distortion challenge current models. We present a systematic HD-ASR study using a high-fidelity clinical speech corpus not previously used for end-to-end ASR training. We compare multiple ASR families under a unified evaluation, analyzing WER as well as substitution, deletion, and insertion patterns. HD speech induces architecture-specific error regimes, with Parakeet-TDT outperforming encoder-decoder and CTC baselines. HD-specific adaptation reduces WER from 6.99% to 4.95% and we also propose a method for using biomarker-based auxiliary supervision and analyze how error behavior is reshaped in severity-dependent ways rather than uniformly improving WER. We open-source all code and models.
💡 Research Summary
This paper addresses the largely unexplored problem of automatic speech recognition (ASR) for Huntington’s disease (HD), a hyperkinetic dysarthria characterized by irregular timing, unstable phonation, and articulatory distortion. The authors assemble a high‑fidelity clinical corpus comprising 4.5 hours of recordings from 94 HD‑positive individuals and 36 healthy controls, stratified across disease stages (pre‑HD, prodromal, manifest). Using a speaker‑independent 70/10/20 split, they evaluate five state‑of‑the‑art ASR families—Whisper (small, medium, large‑v2), Parakeet‑TDT, and Meta Omnilingual CTC—under a unified preprocessing and scoring pipeline. Parakeet‑TDT (a transducer model) achieves a strikingly low zero‑shot word error rate (WER) of 6.99 %, far outperforming Whisper (18–26 %) and the CTC baseline (30 %). Error‑type analysis reveals that Whisper models are insertion‑dominated (≈ 75 % of errors), whereas Parakeet‑TDT exhibits a more balanced distribution of substitutions, deletions, and insertions, indicating architecture‑specific failure modes.
Building on this finding, the authors fine‑tune Parakeet‑TDT for HD speech using parameter‑efficient adapter modules (PEFT) placed on the encoder while keeping the backbone frozen. This adaptation reduces WER to 4.95 % and markedly cuts deletion errors, demonstrating that modest, targeted updates can yield substantial gains even with limited pathological data.
The third contribution explores auxiliary supervision derived from clinically validated biomarkers. Seven interpretable features are extracted across three motor‑speech subsystems: (1) prosody (speech rate proxy, pause‑to‑speech ratio, fundamental‑frequency variance), (2) phonation (local jitter, local shimmer, harmonics‑to‑noise ratio), and (3) articulation (vowel space area variance). After z‑normalization against controls, the continuous values are discretized into low/medium/high bins and used as auxiliary classification targets. During training, a linear head predicts these labels from a masked mean‑pooled encoder representation, and the total loss combines the standard ASR cross‑entropy with a weighted biomarker loss (λ = 0.1).
Results show that none of the biomarker‑augmented models surpass the plain HD‑adapted Parakeet in overall WER (they range from 6.07 % to 6.44 %). However, each auxiliary condition reshapes the error profile: prosody supervision reduces insertions, phonation supervision lowers substitutions, and articulation supervision slightly increases deletions. This “severity‑dependent reshaping” indicates that biomarker signals guide the encoder to encode clinically relevant acoustic structure without uniformly improving transcription accuracy.
In summary, the paper makes four key contributions: (1) introduction of the first publicly described HD speech corpus for end‑to‑end ASR, (2) systematic cross‑architecture benchmarking that highlights the superiority of transducer models for hyperkinetic speech, (3) demonstration that lightweight adapter‑based adaptation yields significant WER reductions, and (4) evidence that clinically grounded biomarker supervision can modulate error types even if it does not further lower overall WER. All code, models, and data processing scripts are open‑sourced, providing a solid foundation for future research on pathological speech recognition across neurological disorders.
Comments & Academic Discussion
Loading comments...
Leave a Comment