A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech
Automatic Speech Recognition (ASR) systems have proliferated over the recent years to the point that free platforms such as YouTube now provide speech recognition services. Given the wide selection of ASR systems, we contribute to the field of automatic speech recognition by comparing the relative performance of two sets of manual transcriptions and five sets of automatic transcriptions (Google Cloud, IBM Watson, Microsoft Azure, Trint, and YouTube) to help researchers to select accurate transcription services. In addition, we identify nonverbal behaviors that are associated with unintelligible speech, as indicated by high word error rates. We show that manual transcriptions remain superior to current automatic transcriptions. Amongst the automatic transcription services, YouTube offers the most accurate transcription service. For non-verbal behavioral involvement, we provide evidence that the variability of smile intensities from the listener is high (low) when the speaker is clear (unintelligible). These findings are derived from videoconferencing interactions between student doctors and simulated patients; therefore, we contribute towards both the ASR literature and the healthcare communication skills teaching community.
💡 Research Summary
This paper investigates two intertwined research questions: (1) which of today’s publicly available online automatic speech recognition (ASR) services provides the most accurate transcription for real‑world medical‑education video consultations, and (2) what non‑verbal facial cues are associated with high transcription error, i.e., unintelligible speech. The authors collected 24 video‑conferenced consultations between twelve second‑year medical students and two professional simulated patients (SPs) using the EQ‑Clinic platform. Each session lasted 12–18 minutes, was recorded at 640 × 480 px and 25 fps, and yielded a total of 28 480 spoken words.
For every video the audio track was extracted and fed to seven transcription pipelines: two manual services (a professional transcriber, “CB”, and a crowdsourced service, “Rev”) and five commercial ASR providers—Google Cloud Speech‑to‑Text, IBM Watson Speech‑to‑Text, Microsoft Azure Speech, Trint, and YouTube’s automatic captions. The authors documented each provider’s required input format (FLAC, WAV, MP4), conversion steps (FFmpeg), and per‑minute cost (YouTube and Google free, Trint $0.25/min, IBM $0.20/min, Azure $0.08/min).
Transcription quality was measured with Word Error Rate (WER). Manual transcriptions achieved low WERs (CB ≈ 3 %, Rev ≈ 4 %). All automatic services performed worse, with WERs ranging from 9 % (YouTube) to 20 % (Trint). Google Cloud and IBM Watson produced WERs around 15–17 %, while Azure was the best among the three cloud APIs at ≈ 12 %. The authors note that YouTube’s superior performance may stem from its video‑based pipeline that jointly processes audio and visual cues, even though the overall error rates are higher than those reported on clean benchmark datasets.
To link transcription errors with listener behavior, the study extracted facial Action Units (AUs) using OpenFace: AU12 (smile), AU4 (frown), head nods, and head shakes. Each utterance was labeled as “high‑error” (WER > 30 %) or “low‑error” (WER < 10 %). In high‑error segments, smile intensity averaged 0.32 (SD 0.07) and showed markedly reduced variability, whereas low‑error segments exhibited higher, more variable smiles (mean 0.58, SD 0.12). Frowning and head shaking showed modest increases in high‑error conditions, but these effects did not reach statistical significance given the limited sample size. The pattern suggests that listeners suppress positive facial expressions when they cannot understand the speaker, possibly reflecting increased cognitive load or affective disengagement.
The authors discuss five plausible sources of ASR error in this context: (1) recording quality degradation due to unstable network or distant microphones, (2) audio compression artifacts introduced during format conversion, (3) domain mismatch between ASR training corpora (general speech) and medical‑consultation dialogue, (4) speaker articulation issues (e.g., dysarthria, heavy accent), and (5) cultural or linguistic factors influencing prosody. They acknowledge that the present study does not isolate these factors, recommending future work to manipulate each variable systematically.
Importantly, the paper proposes a feedback loop where non‑verbal cues inform the ASR system about potential unintelligibility. For instance, a sudden drop in smile intensity could trigger a “low‑confidence” flag, prompting human review or alternative processing. Such a multimodal approach could improve real‑time transcription reliability in tele‑medicine and remote education settings.
In conclusion, while current online ASR services remain inferior to human transcription, YouTube’s free service currently offers the lowest error among the tested platforms. Moreover, the study provides empirical evidence that high transcription error correlates with reduced facial expressivity in listeners, opening avenues for integrating affective signals into ASR pipelines to detect and mitigate unintelligible speech in real‑time communication.
Comments & Academic Discussion
Loading comments...
Leave a Comment