MOSS-TTSD: Text to Spoken Dialogue Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.

💡 Research Summary

The paper introduces MOSS‑TTSD, a novel spoken‑dialogue synthesis system that extends text‑to‑speech (TTS) technology from single‑utterance generation to full‑length, multi‑speaker conversations. The authors identify three core challenges that differentiate dialogue synthesis from conventional TTS: (1) accurate turn‑taking and natural speaker switching, (2) cross‑turn acoustic consistency (i.e., preserving a speaker’s timbre and prosody across many turns), and (3) long‑form stability, meaning the model must generate minutes‑long audio without stitching artifacts or degradation. Existing TTS models largely ignore these issues because they are trained on short, single‑speaker clips and lack explicit dialogue context modeling.

Model Architecture
MOSS‑TTSD builds on the Qwen‑3‑8B‑base large language model (LLM) as its autoregressive backbone and uses the MOSS‑Audio‑Tokenizer, a residual vector quantization (RVQ) codebook that operates at 2 kbps with a 12.5 Hz frame rate. Only the first 16 RVQ layers are modeled, which drastically reduces token length while preserving enough acoustic detail for high‑fidelity speech. The architecture follows the “multi‑head delay” pattern introduced in MusicGen, enabling efficient long‑context generation. By limiting to the top 16 layers, the system can handle up to 3 600 seconds (≈ 60 minutes) of continuous context in a single forward pass, a capability far beyond prior open‑source TTS models.

Data Pipeline
The training data are constructed through a multi‑stage pipeline: raw audio is normalized, diarized, and segmented into clips containing 1‑5 speakers with a maximum duration of 3 600 s. Each clip receives quality scores (DNSMOS), language labels (via Whisper‑large‑v3), and estimated sample rates. The authors then apply MOSS Transcribe Diarize, an end‑to‑end ASR system that simultaneously outputs transcripts and explicit speaker tags, eliminating the mismatch between diarization and transcription that plagued earlier releases. Low‑quality or heavily noisy segments (e.g., movies, esports commentary) are further denoised with MossFormer2, and only clips with DNSMOS ≥ 2 are retained for training.

Speaker Cloning & Synthetic Augmentation
To enable zero‑shot multi‑speaker voice cloning, the pipeline extracts non‑overlapping single‑speaker segments from the same recording and maps them to reference‑audio slots in the prompt template. For multi‑speaker training data, the authors augment the corpus by concatenating single‑speaker clips that share the same diarization ID, thereby creating synthetic dialogues with controlled acoustic transitions. All synthetic samples are filtered to ensure identical sample rates and high DNSMOS (≥ 3). Additionally, rule‑based text augmentation introduces diverse punctuation symbols to improve robustness to varied textual inputs.

Curriculum Learning
Training proceeds in three stages:

Stage 1 continues pre‑training from a single‑speaker TTS checkpoint, expanding sequence length to 65 k tokens and incorporating all single‑ and two‑speaker data (DNSMOS ≥ 2) along with voice‑cloning references.
Stage 2 focuses on high‑quality data (DNSMOS ≥ 3, ≥ 24 kHz), reduces the proportion of single‑speaker examples, and lowers the learning rate to boost audio fidelity.
Stage 3 mixes real multi‑speaker recordings with the synthetic dialogues, teaching the model to handle 1‑5 speakers, maintain turn‑taking stability, and preserve speaker identity over long spans without sacrificing naturalness.

TTSD‑eval: Objective Evaluation Framework
Traditional spoken‑dialogue metrics (cpWER, cpSIM) rely on external speaker‑diarization tools, which introduce cascading errors as the number of speakers grows. The authors propose TTSD‑eval, which bypasses diarization by using forced alignment (MMS‑FA) to align the generated audio with the input script at the word level. Speaker tags are directly taken from the script, and speaker‑embedding similarity (we‑speaker‑SimAMR‑esNet100) is computed between each utterance fragment and the reference audio of every candidate speaker. The highest‑scoring speaker is taken as the prediction, yielding Speaker Attribution Accuracy (ACC). Speaker Similarity (SIM) is the average similarity between each fragment and its ground‑truth speaker’s reference. Word Error Rate (WER) is also reported using Whisper‑large‑v3 after stripping tags and normalizing text. This framework provides a diarization‑independent, reproducible measure of both attribution and voice fidelity.

Experimental Results
The authors evaluate on 50 English and 50 Chinese dialogue samples (30 s–720 s each) covering podcasts, dubbing, sports commentary, and animation. Objective metrics show that MOSS‑TTSD achieves ACC of 0.958–0.963 and SIM of 0.73–0.82, outperforming strong open‑source baselines such as VibeVoice (1.5 B/7 B) and FireRedTTS‑2, and matching or exceeding proprietary systems like Eleven V3. WER is competitive (≈ 5 %–10 %). Human listening tests using Elo‑rating confirm that MOSS‑TTSD leads in perceived speaker attribution, voice similarity, rhythm, and overall quality across all tested scenarios.

Key Contributions

Long‑form single‑pass dialogue synthesis up to 60 minutes, eliminating stitching artifacts.
Zero‑shot multi‑speaker voice cloning for up to five participants, with consistent timbre across turns.
Multilingual support (English, Chinese, Spanish, Portuguese, German, French, Japanese, Korean, Russian) and adaptation to diverse application domains.
TTSD‑eval, a forced‑alignment based evaluation suite that independently measures speaker attribution and similarity without external diarization.

Implications and Future Work
MOSS‑TTSD demonstrates that a discrete‑token, LLM‑driven architecture can scale to the demanding requirements of spoken dialogue generation. By releasing the model, data pipeline, and evaluation code, the authors provide a foundation for downstream applications such as automated podcast creation, dynamic sports commentary, and multilingual audio‑book production. Future research directions include real‑time streaming synthesis, finer‑grained emotional and prosodic control, and expanding the language repertoire further. The TTSD‑eval framework may become a de‑facto standard for benchmarking dialogue‑oriented TTS systems.

MOSS-TTSD: Text to Spoken Dialogue Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment