Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language

Enabling Automatic Disordered Speech Recognition: An Impaired Speech Dataset in the Akan Language
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents a curated corpus of speech samples from native Akan speakers with speech impairment. The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder. Recordings were done in controlled supervised environments were participants described pre-selected images in their own words. The resulting dataset is a collection of audio recordings, transcriptions, and associated metadata on speaker demographics, class of impairment, recording environment and device. The dataset is intended to support research in low-resource automatic disordered speech recognition systems and assistive speech technology.


💡 Research Summary

The paper addresses a critical gap in speech technology: the scarcity of impaired‑speech data for low‑resource languages, focusing on Akan, a major Ghanaian language. Existing commercial speech assistants such as Alexa and Siri are trained on standard speech and fail to recognize the atypical acoustic patterns exhibited by speakers with speech disorders. Moreover, no publicly available corpus exists for automatic disordered speech recognition (ADSR) in Akan. To remedy this, the authors present a newly curated dataset—UGAkan‑ImpairedSpeechData—comprising 14,312 audio recordings that total 50.01 hours of speech from native Akan speakers with four distinct impairment classes: stammering, cerebral palsy, cleft palate, and stroke‑induced speech disorder.

Data collection was carried out using a modified version of the UGSpeechData mobile application, originally designed for standard Akan speech. Key modifications include: removal of pause restrictions to preserve natural hesitations and filler sounds; expansion of the visual prompt set from 1,000 to 1,200 images, many of which are simpler to ensure that participants with cognitive or motor challenges can describe them; extension of the maximum recording length from 30 seconds to 60 seconds to accommodate slower articulation and repetitions; addition of aetiology fields to the user profile for precise labeling of impairment type; and disabling of in‑app validation/transcription features to allow offline, expert‑driven transcription. The app also incorporates a real‑time noise‑level check that blocks recording when ambient sound exceeds a predefined threshold, thereby improving signal‑to‑noise ratio.

For transcription, the authors developed a custom desktop application built with Python (Tkinter for GUI, Pygame for audio playback, Pillow for image rendering). The tool displays the audio file alongside its associated image prompt, provides standard playback controls, and integrates a dedicated Akan keyboard that supports characters such as “ɔ” and “ɛ”. Transcriptions are entered into an editable Excel spreadsheet that autosaves after each entry, preventing data loss. The application is packaged with PyInstaller for easy distribution across Windows machines.

The collection pipeline followed a multi‑stage methodology: (1) participant screening and consent, conducted via convenience sampling from speech‑therapy clinics in Kumasi and Accra; (2) recording sessions where participants described images in their own words, with assistance from therapists when necessary; (3) validation and transcription performed by eight linguistically trained native Asante Twi speakers. Transcription was a two‑phase process: each audio file was independently transcribed by two experts; if the transcripts matched (or were near‑identical) the file was accepted. In cases of disagreement, the file was reassigned to a new pair of experts for a third independent review. This inter‑rater reliability design helped to surface systematic ambiguities and reduce subjective bias.

The authors provide exhaustive transcription guidelines. Transcribers are instructed to capture every utterance exactly as heard, including repetitions, self‑corrections, elongated sounds, and non‑verbal vocalizations (e.g., “eeee”, “aaaa”). They must retain original spelling, punctuation, and capitalization according to standard Akan orthography, avoid correcting grammatical errors, and spell out numbers while flagging personally identifiable information (PII) with the keyword “reject”. English loanwords must be transliterated into Akan (e.g., “kɔmputa” for “computer”), and any uncertain transliteration is marked with the keyword “language” for later review. Specific rules are also given for handling stretched words, split words, unfinished words, and stammering‑related blockings, ensuring consistent treatment of disorder‑specific phenomena across the corpus.

Metadata accompanying each recording includes speaker ID, age, gender, impairment class, recording date, environment description, device used, and the image prompt identifier. This rich annotation enables downstream researchers to perform condition‑specific analyses, domain adaptation, or multi‑task learning that leverages speaker or environment attributes.

The dataset is released under an open‑access license via a GitHub repository and assigned a Zenodo DOI, facilitating reproducibility and community adoption. The authors anticipate several research avenues: (a) training acoustic models that are robust to disfluencies and atypical phonation patterns; (b) exploring transfer learning from the larger standard‑Akan corpora to the impaired‑speech domain; (c) evaluating multilingual or cross‑lingual ADSR techniques that can benefit other low‑resource languages; and (d) developing assistive applications (e.g., speech‑to‑text, voice‑controlled interfaces) tailored for speakers with speech disorders in Ghana and the broader West African region.

In conclusion, this work delivers the first comprehensive, high‑quality impaired‑speech corpus for Akan, complete with detailed transcription protocols, rigorous validation procedures, and extensive speaker metadata. By openly sharing the resource, the authors lay a solid foundation for inclusive speech technology research in low‑resource settings, encouraging the community to build more equitable ASR systems that serve speakers with diverse communication abilities.


Comments & Academic Discussion

Loading comments...

Leave a Comment