DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

💡 Research Summary

The paper “DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance” introduces a novel framework designed to overcome a persistent challenge in controllable TTS: the entanglement between speaker timbre and speaking style. Existing methods often leak timbre information when using a reference audio for style control, or support only a single modality for style prompting. DMP-TTS addresses these limitations by enabling explicit disentanglement and flexible multi-modal control.

The core of DMP-TTS is a latent Diffusion Transformer (DiT) model that generates mel-spectrogram latents conditioned on three separate inputs: content (text), timbre (speaker), and style. For style conditioning, the authors propose Style-CLAP, a unified multi-modal encoder. Built upon a pre-trained CLAP model, it aligns style cues from reference audio and descriptive text into a shared embedding space. Its training is enhanced with a multi-task objective that predicts specific style attributes (emotion, energy, speech rate), ensuring the learned representations are both cross-modal and highly discriminative for style, while carefully excluding timbre-related descriptors.

To achieve fine-grained, independent control during inference, the paper introduces chained Classifier-Free Guidance (cCFG). During training, a hierarchical condition dropout strategy is employed, where conditions are dropped in the order: style, then timbre (if style is dropped), then text (if both style and timbre are dropped). This structured training enables a chained guidance formula at inference time, allowing users to independently adjust the guidance strength scales for content (s_text), timbre (s_spk), and style (s_style). This mechanism is key to disentangling the attributes and enabling prompts like “speak this text with speaker A’s voice but in speaker B’s style.”

Furthermore, to stabilize training and improve linguistic fidelity, the authors adopt Representation Alignment (REPA). This technique distills the rich acoustic-semantic features from a pre-trained Whisper model into an intermediate layer of the DiT student model via a cosine similarity loss, acting as a form of knowledge distillation.

Experiments were conducted on a 300-hour internal Chinese speech dataset. In zero-shot evaluations, DMP-TTS was compared against large-scale open-source baselines like CosyVoice and IndexTTS2. The results demonstrated that DMP-TTS achieves superior style controllability, with higher accuracy in transferring emotion, energy, and speech rate via both text and audio prompts. It maintained competitive naturalness (MOS) and intelligibility (Word Error Rate). An ablation study confirmed the contributions of the multi-task supervision in Style-CLAP for style accuracy and REPA for lowering WER and stabilizing convergence. The paper also notes that text prompts offer more stable style control, while audio prompts lead to slightly higher naturalness, highlighting the complementary strengths of each modality.

In summary, DMP-TTS presents a flexible and effective framework for disentangled, multi-modal controllable TTS, achieving a strong balance between precise attribute control, speech naturalness, and intelligibility through its innovative Style-CLAP encoder, chained CFG mechanism, and representation alignment technique.

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

💡 Research Summary

Comments & Academic Discussion

Leave a Comment