Zero-Shot TTS With Enhanced Audio Prompts: Bsc Submission For The 2026 Wildspoof Challenge TTS Track
We evaluate two non-autoregressive architectures, StyleTTS2 and F5-TTS, to address the spontaneous nature of in-the-wild speech. Our models utilize flexible duration modeling to improve prosodic naturalness. To handle acoustic noise, we implement a multi-stage enhancement pipeline using the Sidon model, which significantly outperforms standard Demucs in signal quality. Experimental results show that finetuning enhanced audios yields superior robustness, achieving up to 4.21 UTMOS and 3.47 DNSMOS. Furthermore, we analyze the impact of reference prompt quality and length on zero-shot synthesis performance, demonstrating the effectiveness of our approach for realistic speech generation.
💡 Research Summary
The paper presents the authors’ entry to the 2026 WildSpoof Challenge TTS track, focusing on zero‑shot text‑to‑speech synthesis from “in‑the‑wild” speech recordings that contain environmental noise, hesitations, fillers, and irregular prosody. Two non‑autoregressive models are evaluated: StyleTTS2, which uses a style‑diffusion framework and an acoustic encoder that directly consumes a reference audio prompt, and F5‑TTS, a mask‑based transformer that treats the reference as a conditioning prompt for masked regions. Both models are fine‑tuned on the TITW‑Easy subset after a dedicated speech‑enhancement stage.
The enhancement pipeline replaces the standard Demucs denoiser with the recently released Sidon model, a fast multilingual speech restoration system. Sidon is shown to increase Mean Opinion Score (MOS) and to recover high‑frequency energy (>8 kHz) that Demucs often suppresses. The authors report that Sidon‑enhanced prompts improve objective quality metrics—UTMOS rises from ~3.5 to ~4.0 and DNSMOS‑Pro from ~3.1 to ~3.5—while also reducing Word Error Rate (WER) from ~0.14 to ~0.07. However, speaker similarity measured by Speaker Encoder Cosine Similarity (SECS) drops modestly (≈0.07–0.12), indicating a trade‑off between noise removal and preservation of speaker‑specific spectral cues.
Training details: F5‑TTS is fine‑tuned from the public F5‑TTS v1 Base checkpoint for 75 k steps (learning rate 1e‑5, 5 k warm‑up, batch size 76 800 audio frames) without altering the tokenizer vocabulary. StyleTTS2 is fine‑tuned from a LibriTTS‑pretrained checkpoint for 12 k steps (learning rate 1e‑4, batch size 16, max token length 800), with diffusion training introduced after the first epoch. A smaller “tiny” F5‑TTS variant (12 layers, 16 heads) trained from scratch for 1 M steps performed poorly (UTMOS 3.27, SECS 0.10), underscoring the importance of large‑scale pretraining.
Prompt length experiments compare “long” (≈7.7 s) versus “short” (≈5.5 s) reference audios. Both models suffer a decrease in SECS when the prompt is shortened, but StyleTTS2 exhibits a pronounced increase in WER (from 0.21 to 0.49), reflecting its heavier reliance on the reference for speaker modeling. F5‑TTS’s WER remains stable, suggesting its conditioning mechanism is more robust to limited reference duration.
The authors evaluate two test scenarios: KSKT (known speaker, known text) and KSUT (known speaker, unknown text). In the KSUT condition, the best results are achieved with F5‑TTS using Sidon‑enhanced long prompts (UTMOS 4.02, DNSMOS‑Pro 3.47, WER 0.13) and with StyleTTS2 under the same conditions (UTMOS 4.21, DNSMOS‑Pro 2.99, WER 0.10). Spectrogram visualizations confirm that Sidon restores high‑frequency harmonics, widening the bandwidth for F5‑TTS and making the spectral content more consistent for StyleTTS2.
The paper concludes that (1) leveraging large pretrained TTS models via fine‑tuning is essential for robust zero‑shot synthesis on noisy, spontaneous speech; (2) the quality and duration of the reference audio are critical factors influencing both speaker similarity and overall audio quality; and (3) Sidon‑based enhancement outperforms traditional denoisers like Demucs, delivering higher perceptual quality while preserving intelligibility. Future work is suggested on adaptive enhancement strategies that balance noise reduction with speaker identity preservation, and on meta‑learning approaches that dynamically select optimal prompt length and quality for each target speaker.
Comments & Academic Discussion
Loading comments...
Leave a Comment