ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan

ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge Evaluation Plan
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).


💡 Research Summary

The paper addresses a critical gap in audio deepfake detection: most existing systems focus on whole‑audio forgeries, yet in realistic scenarios audio consists of two distinct components—foreground speech and background environmental sounds—that can be manipulated independently. Modern text‑to‑speech, voice‑conversion, and other generative models now allow attackers to alter either component while leaving the other untouched, producing “component‑level” forgeries that sound natural to human listeners and easily evade detectors trained on fully synthetic or fully genuine recordings.

To tackle this problem the authors introduce two major contributions. First, they release CompSpoofV2, a large‑scale, curated dataset specifically designed for component‑level anti‑spoofing research. CompSpoofV2 contains over 250 000 four‑second audio clips (≈283 h total) drawn from a wide variety of public corpora such as AudioCaps, VGGSound, CommonVoice, LibriTTS, and several environmental‑sound collections (TAU Urban, TUT‑SED, UrbanSound, etc.). Each clip is labeled into one of five mutually exclusive classes that cover all combinations of genuine (bona‑fide) and spoofed speech and background: (0) original (no manipulation), (1) genuine speech + genuine background (mixed from different sources), (2) spoofed speech + genuine background, (3) genuine speech + spoofed background, and (4) spoofed speech + spoofed background. The training/validation sets share the same source distribution, while the evaluation and test sets contain newly generated mixtures that are unseen during training, thereby assessing generalization to novel attack conditions.

Second, the authors propose a separation‑enhanced joint learning framework as a baseline solution. The pipeline consists of: (1) a pre‑filter that decides whether a mixture is potentially spoofed; (2) a neural source‑separation module that splits the audio into speech and environmental streams; (3) two dedicated anti‑spoofing classifiers—one specialized for speech, the other for environmental sounds; and (4) a fusion layer that maps the combined outputs to the five target classes. Crucially, the separation stage is trained jointly with the downstream classifiers so that spoofing cues (e.g., subtle spectral artifacts) are preserved rather than washed out. The loss function balances per‑class F1 scores, encouraging equal treatment of all categories despite any class imbalance.

Evaluation is based on Macro‑F1 across the five classes, ensuring each class contributes equally to the final score. In addition, three auxiliary Equal Error Rate (EER) metrics are reported for diagnostic purposes: EER_original (distinguishing class 0 from the rest), EER_speech (detecting any spoofed speech component), and EER_env (detecting any spoofed environmental component). These auxiliary metrics are not used for leaderboard ranking but help participants understand where their systems succeed or fail. Baseline results show a strong performance on the validation set (Macro‑F1 = 0.9462) but a notable drop on the evaluation and test sets (Macro‑F1 ≈ 0.62–0.63), with EERs ranging from 0.017 to 0.43, highlighting the difficulty of component‑level detection.

The challenge, named ESDD2 (Environment‑Aware Speech and Sound Deepfake Detection Challenge), will run in conjunction with IEEE ICME 2026. Participants submit predictions via the CodaBench platform; each submission must contain the audio ID, predicted class ID, and confidence scores for the original, speech, and environmental components. Up to ten submissions are allowed in the final ranking phase, with the best three averaged for the final score. The competition enforces strict data‑usage rules: evaluation and test sets are prohibited for training, only pre‑January 1 2026 public models may be used without explicit disclosure, and any additional public datasets require prior organizer approval. Synthetic audio generated by TTS, VC, or other generative models is explicitly forbidden for training.

A detailed timeline is provided: registration opens 10 Jan 2026, data and baseline release on 30 Jan 2026, dataset‑approval deadline 20 Feb 2026, test set release and final ranking update on 20 Mar 2026, leaderboard freeze and result notification on 25 Apr 2026, followed by paper submission deadlines in late April and early May. The first‑place winner receives a US $1,000 prize from OfSpectrum, Inc., an AI company specializing in imperceptible watermarking for content provenance.

In summary, the paper makes three pivotal contributions to the audio deepfake field: (1) a realistic, large‑scale component‑level dataset that reflects real‑world mixing conditions; (2) a novel separation‑enhanced joint learning baseline that explicitly models the two audio components; and (3) a well‑structured international challenge that encourages the community to develop robust, component‑aware detection methods. By exposing the limitations of whole‑audio detectors and providing the tools and incentives to overcome them, the work is poised to drive significant advances in secure, trustworthy audio technologies.


Comments & Academic Discussion

Loading comments...

Leave a Comment