Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.


💡 Research Summary

Audio MultiChallenge introduces a comprehensive, open‑source benchmark designed to evaluate end‑to‑end (E2E) spoken dialogue systems under realistic, multi‑turn human interaction. Building on the text‑based MultiChallenge framework, the authors define four evaluation axes: Inference Memory, Instruction Retention, Self Coherence, and a novel Voice Editing axis that specifically targets the challenges of mid‑utterance speech repairs and backtracking unique to spoken language. Inference Memory is further expanded with Audio‑Cue tasks that require models to recall ambient sounds, speaker emotion, or other paralinguistic cues that are not explicitly verbalized. Instruction Retention tests the ability to follow complex, evolving user directives across turns, while Self Coherence measures whether a model maintains factual consistency, persona stability, and avoids unwarranted contradictions throughout a conversation.

To create a dataset that reflects natural disfluencies, the authors employ a two‑stage hybrid pipeline. First, a multi‑agent synthetic loop automatically generates failure‑inducing conversation blueprints for each axis by iteratively probing an Audio LM with TTS‑generated user turns and detecting failures. Once a failure is triggered, the blueprint (including topic, persona, and instruction set) is handed to human contributors who record unscripted, spontaneous speech, preserving real‑world acoustic variability such as background noise, accents, and spontaneous self‑corrections. This process yields 452 multi‑turn dialogues from 47 speakers, totaling roughly 15 hours of raw 48 kHz audio, and 1,712 fine‑grained rubrics that decompose each ideal response into atomic requirements.

Evaluation is performed with an LLM‑as‑a‑judge system that scores each rubric; the automatic scores achieve 93 % agreement with human judges, providing high‑resolution failure attribution while remaining scalable. The benchmark is released publicly along with a leaderboard for reproducible comparison.

Experimental results on a suite of proprietary and open‑source models reveal substantial performance gaps. The best model, Gemini 3 Pro Preview (Thinking), attains only a 54.65 % overall pass rate. Performance varies markedly across axes: Voice Editing is the most difficult (25.9 % pass), Audio‑Cue Inference Memory lags 36.5 % behind pure semantic memory, and Self Coherence degrades sharply with longer audio context—dropping from 33.3 % on short tasks to 20.0 % on conversations spanning 3–5 minutes. These findings indicate that current E2E speech models struggle with long‑range context retention, dynamic filtering of retracted speech, and integration of non‑verbal acoustic cues.

The paper discusses limitations, notably the current focus on English and reliance on human‑crafted blueprints, and outlines future work such as multilingual expansion, automated rubric generation, and training strategies that explicitly model memory and editing mechanisms for speech. In sum, Audio MultiChallenge provides the first reproducible, audio‑native testbed that captures the full complexity of natural spoken dialogue, offering a clear target for the next generation of robust, multi‑turn speech assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment