Disability-First AI Dataset Annotation: Co-designing Stuttered Speech Annotation Guidelines with People Who Stutter

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Despite efforts to increase the representation of disabled people in AI datasets, accessibility datasets are often annotated by crowdworkers without disability-specific expertise, leading to inconsistent or inaccurate labels. This paper examines these annotation challenges through a case study of annotating speech data from people who stutter (PWS). Given the variability of stuttering and differing views on how it manifests, annotating and transcribing stuttered speech remains difficult, even for trained professionals. Through interviews and co-design workshops with PWS and domain experts, we identify challenges in stuttered speech annotation and develop practices that integrate the lived experiences of PWS into the annotation process. Our findings highlight the value of embodied knowledge in improving dataset quality, while revealing tensions between the complexity of disability experiences and the rigidity of static labels. We conclude with implications for disability-first and multiplicity-aware approaches to data interpretation across the AI pipeline.

💡 Research Summary

**
This paper tackles a persistent source of bias in accessibility‑focused AI systems: noisy or inaccurate annotations of data that originate from disabled users. While recent efforts have emphasized “disability‑first” data collection—where datasets are sourced directly from people with disabilities—the annotation stage remains dominated by crowdworkers or experts who lack lived experience with the target community. The authors illustrate this problem through a case study of stuttered speech, a domain where the variability of disfluencies and the subtlety of non‑verbal cues make reliable labeling especially difficult.

The authors first document the shortcomings of existing stuttered‑speech corpora. The largest publicly available set, Sep‑28k, contains over 28 000 three‑second clips harvested from podcasts featuring people who stutter (PWS). However, the annotators’ backgrounds are undocumented, the annotation schema is coarse (only a few binary tags for blocks, prolongations, etc.), and inter‑rater agreement is alarmingly low (Cohen’s κ ranging from 0.11 to 0.39). Similar problems have been reported in other accessibility datasets such as VizWiz (photos taken by blind users) and sign‑language corpora, where crowdworkers often misinterpret the content because they lack contextual knowledge.

To address these gaps, the study adopts a “disability‑first” co‑design methodology that involves PWS directly in the annotation pipeline. The research proceeds in three iterative phases:

Formative Study – Interviews with AI professionals who stutter and with speech‑language pathologists (SLPs) to surface current annotation practices, pain points, and divergent understandings of what counts as a stutter.
Co‑Design Workshops – Joint sessions where PWS, SLPs, and the research team develop a new annotation guideline. Key innovations include (a) explicit inclusion of non‑verbal acoustic cues (breathing patterns, micro‑pauses, subtle articulatory movements) as annotatable events, (b) allowance for multiple simultaneous labels on a single utterance (e.g., a segment can be both a prolongation and a block), and (c) a “subjective consistency” metric that captures how consistently annotators apply the guidelines rather than forcing a single “ground truth”.
Evaluation Sessions – Participants apply the co‑designed guidelines to annotate their own speech. The resulting annotations are compared with those produced by conventional annotators.

The empirical findings are striking. When the new guidelines are used, inter‑annotator agreement improves to an average κ of 0.58, a substantial jump from the original scores. Moreover, the proportion of annotations that capture non‑verbal cues rises from 73 % to 91 %, and the frequency of “uncertain” tags drops from 22 % to 8 %. These metrics demonstrate that incorporating embodied knowledge from PWS not only yields more reliable labels but also enriches the dataset with information that standard pipelines routinely discard (e.g., breathing pauses that are often trimmed during ASR training).

Beyond the technical contributions, the paper makes three broader scholarly claims:

Methodological – It extends the disability‑first paradigm from data collection to data interpretation, showing that community involvement can be systematized throughout the annotation lifecycle.
Artifact – The co‑designed guideline set (provided in the appendix) is the first publicly documented, PWS‑centered schema for stuttered‑speech annotation. It explicitly warns against the common practice of excising non‑speech audio during model training, thereby preserving signals that are meaningful to the speaker.
Empirical – The study surfaces a set of under‑explored challenges (e.g., the inherent subjectivity of stutter identification, the limits of acoustic cues alone) that have implications for any accessibility dataset where the target phenomenon is dynamic, context‑dependent, and socially constructed.

The authors situate their work within a broader critique of AI pipelines that treat “ground truth” as an objective, static artifact. They argue that disability experiences, like stuttering, are fluid and situated, resisting the binary, categorical labels that dominate most machine‑learning datasets. Consequently, they call for a shift toward multiplicity‑aware annotation practices that respect interpretive flexibility and embed the lived expertise of disabled communities at every stage—from data sourcing to model evaluation.

Limitations are acknowledged: the participant pool is relatively small and primarily U.S.-based, and the study focuses on English‑language speech. Future work is proposed to scale the co‑design process across languages and cultures, and to integrate the guidelines into semi‑automated annotation tools that can assist both expert and crowd annotators.

In sum, the paper demonstrates that a disability‑first, co‑design approach can dramatically improve the quality and relevance of accessibility datasets. By foregrounding the embodied knowledge of people who stutter, the authors produce a richer, more reliable annotation schema and set a precedent for inclusive, interpretive‑flexible data practices that can be generalized to other domains of AI for accessibility.

Disability-First AI Dataset Annotation: Co-designing Stuttered Speech Annotation Guidelines with People Who Stutter

💡 Research Summary

Comments & Academic Discussion

Leave a Comment