AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition a ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
💡 Research Summary
Background and Motivation
Recent advances in zero‑shot text‑to‑speech (ZS‑TTS) have enabled the synthesis of any unseen speaker’s voice from a few seconds of reference audio, achieving naturalness comparable to human recordings. However, these systems are typically trained on predominantly American English data and lack any mechanism to control or faithfully reproduce accents. Accent is a crucial marker of personal and regional identity, and preserving it is essential for inclusive speech technologies, language learning tools, and personalized virtual assistants. The authors therefore define a new task—zero‑shot accent generation—that unifies foreign accent conversion (FAC), multi‑accent TTS, and ZS‑TTS, and they propose a two‑stage pipeline to address it.
Related Work and Gaps
- Foreign Accent Conversion (FAC) converts a source speaker’s L2 accent to a target L1 accent but cannot generate speech from arbitrary text nor generalize to unseen accent pairs.
- Accented TTS conditions on text, speaker ID, and accent ID, yet it cannot handle unseen speakers or unseen accents.
- Current ZS‑TTS (e.g., VALL‑E X, VALL‑E‑X, LLM‑based models) only conditions on speaker embeddings, ignoring accent information and often producing accent‑mismatched or hallucinated speech for accented speakers.
A further obstacle is speaker‑accent entanglement: most corpora contain a single accent per speaker, causing models to memorize speaker‑to‑accent mappings rather than learning accent‑specific cues. This entanglement harms both accent identification (AID) and zero‑shot TTS.
Stage 1 – GenAID: Generalisable Accent Identification
The authors build on a wav2vec‑2.0 (XLSR‑large) backbone and introduce five key modifications:
- Unseen‑Speaker Validation – The dataset is re‑split so that validation and test sets contain speakers never seen during training, forcing the model to learn accent‑discriminative features rather than speaker memorization.
- Weighted Sampling – Inverse‑frequency sampling balances the highly skewed distribution of accent labels, ensuring each accent appears equally often per mini‑batch.
- Data Augmentation – Speed perturbation and additive noise broaden the acoustic conditions, making the model robust to recording device and environment variations.
- Information Bottleneck – A two‑layer MLP (GELU) compresses the 1024‑dim XLSR output to a 64‑dim continuous accent embedding, discarding redundant speaker information.
- Adversarial Training – An auxiliary speaker classifier is trained to predict a uniform distribution over speakers (MSE loss against an even prior), encouraging the encoder to be speaker‑agnostic. The total loss is
L = L_acc + α·L_MSE, with α = 10.
Training uses the CommonAccent subset of CommonVoice 17.0, filtered to retain accents with ≥10 speakers (≥50 utterances each for training/seen‑speaker validation, plus ≥20 speakers for unseen‑speaker validation). The final model (system #6) achieves F1 = 0.56 and accuracy = 0.56 on unseen speakers—far above the random baseline (0.08). The gap between seen and unseen speaker accuracy shrinks from 0.53 to 0.06, and the Silhouette Coefficient for speaker clusters drops from 0.236 to 0.079, confirming effective disentanglement. t‑SNE visualisations show well‑separated accent clusters with overlapping speaker clusters.
Stage 2 – AccentBox: Zero‑Shot Accent Generation
AccentBox leverages the pretrained GenAID encoder to provide a continuous accent embedding for each reference utterance. The authors adopt the open‑source YourTTS architecture (non‑autoregressive, flow‑based decoder with HiFi‑GAN vocoder) and modify it as follows:
- Replace the one‑hot language embedding with the 64‑dim accent embedding, feeding it into both the text encoder and the stochastic duration predictor.
- Keep the speaker encoder unchanged; speaker embeddings are still derived from a separate pretrained speaker verification model.
- During training, the same text is spoken by multiple speakers with different accents, allowing the model to learn to separate content, speaker identity, and accent style.
Three inference scenarios are explored (Table II):
- Inherent Accent Generation – Speaker and accent match (baseline for fidelity).
- Cross Accent Generation – Speaker and accent are mismatched, testing the model’s ability to impose a new accent while preserving speaker characteristics.
- Unseen Accent Generation – Both speaker and accent are unseen, probing true zero‑shot capability.
Evaluation
- Objective Metrics:
Accent Cosine Similarity (AccCos) – cosine similarity between accent embeddings of reference and synthesized speech (using both GenAID #4 and #6).
Speaker Cosine Similarity (SpkCos) – cosine similarity of Resemblyzer speaker embeddings. - Subjective Listening Tests: 10 listeners per utterance (recruited via Prolific from the target accent regions) rated (i) accent similarity, (ii) speaker similarity, and (iii) naturalness. Two test formats were used: ABC ranking for inherent accent generation (Baseline vs Accent_ID vs Proposed) and AB preference for cross‑accent generation (Accent_ID vs Proposed).
Results show that AccentBox outperforms the Accent_ID baseline (which uses one‑hot accent IDs) by 57.4 %–70.0 % in accent‑similarity preference across both inherent and cross‑accent conditions. Speaker similarity remains comparable, and naturalness scores are on par with the strong baselines, indicating that the added accent conditioning does not degrade overall audio quality.
Contributions and Impact
- First systematic quantification of speaker‑accent entanglement in AID datasets and demonstration of accent‑mismatch/hallucination problems in existing ZS‑TTS.
- Introduction of a speaker‑agnostic, continuous accent embedding via GenAID, employing information bottleneck and adversarial training to disentangle accent from speaker.
- Definition of the zero‑shot accent generation task and establishment of a benchmark that unifies FAC, accented TTS, and ZS‑TTS.
- State‑of‑the‑art results: 0.56 F1 on unseen‑speaker AID and 57.4 %–70.0 % accent‑similarity preference in zero‑shot generation.
Limitations and Future Directions
- Evaluation of unseen accents was limited to two accents (American and Irish) due to budget constraints; broader multilingual testing is needed.
- The current pipeline relies on a non‑LLM TTS backbone; integrating large language model decoders could further improve prosody and naturalness.
- Real‑world deployment would benefit from on‑device efficiency optimizations and robustness to noisy reference audio.
Future work may explore (a) scaling GenAID to dozens of languages, (b) hierarchical accent representations that capture fine‑grained regional variants, and (c) joint training of accent and speaker encoders to further tighten the disentanglement while preserving expressive flexibility.
Comments & Academic Discussion
Loading comments...
Leave a Comment