TidyVoice 2026 Challenge Evaluation Plan
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field’s reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice, and specifically curated to isolate the effect of language switching across approximately 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate on cross-language trials. By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies, directly aligning with the Interspeech 2026 theme, “Speaking Together.”
💡 Research Summary
The paper presents the TidyVoice 2026 Challenge, an evaluation campaign designed to benchmark speaker verification systems under cross‑lingual conditions. Recognizing that most existing speaker verification research relies heavily on English‑centric corpora, the authors introduce a multilingual benchmark derived from Mozilla Common Voice, called TidyVoiceX. This dataset contains recordings from 4,474 speakers across 40 languages, amounting to roughly 457 hours of read speech. The challenge is organized around an open‑training condition: participants may use the official TidyVoiceX training partition, any publicly available speech corpora, pre‑trained models (e.g., VoxCeleb, VoxBlink, wav2vec 2.0, HuBERT, WavLM), and non‑speech augmentation data such as MUSAN. The only restriction is that no data outside the designated TidyVoiceX training split may be taken from the Mozilla Common Voice source, preventing data leakage.
The task is classic text‑independent speaker verification. For each trial, an enrollment segment (or multiple segments) belonging to a target speaker is paired with a test segment; the system must output a similarity score, preferably a log‑likelihood ratio (LLR). Trials are organized into two evaluation lists: tv26_eval‑A, where enrollments come from languages seen during training but tests are from unseen languages; and tv26_eval‑U, where both enrollment and test utterances are from languages never encountered in training. In total, the final evaluation set uses 38 languages that are completely disjoint from the 40 languages present in training and development, ensuring a strict test of language‑independent capabilities. Language labels are hidden during evaluation to avoid any language‑aware cheating.
Performance is primarily measured by Equal Error Rate (EER). During development, four sub‑categories are reported: (1) target‑target same language, (2) target‑target different language, (3) non‑target same language, and (4) non‑target different language. This granularity allows analysis of how language mismatch affects both false‑accept and false‑reject rates. As a secondary metric, Minimum Detection Cost Function (minDCF) is computed with costs C_miss = C_fa = 1 and target prior P_tar = 0.01. The minDCF serves as a tie‑breaker and provides a cost‑weighted view of system performance.
The challenge defines strict rules to guarantee fairness and reproducibility. Participants may submit up to three distinct systems per trial list, for a maximum of three systems overall. Re‑identification of real‑world speakers or any form of data recombination (linking challenge speakers to external records) is prohibited. All external data, pre‑trained models, and augmentation techniques must be fully disclosed in a system description paper. Submissions must include a plain‑text score file that mirrors the order of the provided trial list, and a validation script checks format compliance before acceptance. After the evaluation phase, teams must submit a detailed system description following the IEEE ICASSP or Interspeech template, covering front‑end processing, back‑end modeling, data partitions, external resources, and computational cost (CPU/GPU time and memory per trial). The top‑ranking team for each task must also provide the trained model and an inference script in a zip archive to enable independent reproduction; failure to reproduce results leads to disqualification.
To lower the entry barrier, the organizers release an official baseline system built with the WeSpeaker toolkit, a widely used ASV framework. The baseline fine‑tunes a publicly available pre‑trained model on the TidyVoiceX training data and employs standard components such as voice activity detection, filter‑bank or MFCC features, and PLDA scoring. Participants can use this baseline as a reference point and improve upon it with novel architectures, self‑supervised learning, or advanced data‑augmentation strategies.
Data access is managed through the Mozilla Data Collective. After registering for the challenge, participants receive credentials to download the TidyVoiceX package (≈50 GB), which includes training, development, and metadata (speaker IDs, language tags). Audio files are 16 kHz WAV, organized hierarchically by speaker ID and language ID, facilitating straightforward scripting.
In summary, the TidyVoice 2026 Challenge provides a rigorously defined, multilingual benchmark that isolates the effect of language mismatch on speaker verification. By offering a large, publicly available dataset, clear evaluation protocols, open‑training conditions, and mandatory reproducibility artifacts, the challenge aims to stimulate research on language‑independent speaker embeddings, promote fairness across linguistic communities, and set a new standard for cross‑lingual speaker verification evaluation.
Comments & Academic Discussion
Loading comments...
Leave a Comment