CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
💡 Research Summary
CoCoEmo tackles a fundamental limitation of current emotional text‑to‑speech (TTS) systems: the reliance on a single, utterance‑level emotion label that collapses the rich, compositional nature of human affect. The authors observe that natural speech often conveys multiple, sometimes conflicting, affective cues that may diverge from the literal meaning of the text. To address this, they introduce activation steering – a lightweight, post‑training technique that manipulates the latent activations of a pretrained hybrid TTS model with direction vectors representing specific emotions.
The paper first dissects modern hybrid TTS architectures, which consist of a text‑to‑speech language model (SLM) that generates discrete speech tokens from text and an acoustic flow‑matching decoder (Flow) that converts those tokens into mel‑spectrograms. By designing a “cross‑conditioning” experiment, the authors compare two extremes: injecting emotion only at the SLM stage (SLM‑driven) versus injecting it only at the Flow stage (Flow‑driven). They evaluate the resulting speech on three emotion‑correlated acoustic dimensions – fundamental frequency (F0), energy, and speaking rate – using concordance correlation coefficient (CCC) and standard deviation (STD) metrics. The SLM‑driven condition yields markedly different prosodic patterns across emotions, while the Flow‑driven condition produces nearly identical contours, indicating that emotional prosody is primarily encoded in the SLM and that Flow mainly refines acoustic rendering.
Having identified the SLM as the primary driver of emotional variability, the authors next locate the most “steerable” sub‑components within the SLM. They train linear probes on the hidden activations of each transformer layer and each operation (e.g., self‑attention, feed‑forward) to predict emotion labels. Layers in the middle‑to‑late range (approximately 10–17 for CosyVoice2, 5–10 for IndexTTS2) and especially the attention output vectors achieve the highest linear separability. This linear separability serves as a proxy for steerability: the more linearly separable the emotion representations, the more reliably a single direction vector can shift activations without disturbing other attributes.
The steering vectors themselves are constructed via a mean‑difference approach. For each target emotion, the authors pair neutral utterances with emotion‑laden utterances from the same speaker and identical transcripts, thereby isolating acoustic emotion variation while controlling for lexical and speaker factors. The vector is the difference between the mean activation of the emotion set and the mean activation of its neutral counterpart, computed on the selected layer and operation (typically the attention output of a middle‑late layer). During inference, the vector is scaled by a user‑specified strength and added to the activation, effectively biasing the model toward the desired affect.
Mixed‑emotion synthesis is achieved by linearly combining single‑emotion vectors with user‑defined weights that sum to one. This enables quantitative control over the proportion of each emotion (e.g., 70 % happiness, 30 % anger). Because the steering operates directly on the latent space, it can also override the emotion implied by the text, allowing controlled text‑emotion mismatch scenarios (e.g., a nervous laugh while delivering bad news).
To evaluate these capabilities, the authors propose a multi‑rater annotation framework. Instead of forcing a single emotion label, listeners are allowed to assign multiple emotion tags and rate the degree of each, providing a richer picture of mixed‑emotion perception and of how well the synthesized speech diverges from the textual semantics. Experiments across multiple datasets and backbone models demonstrate that CoCoEmo’s steering yields higher perceived emotional diversity, naturalness, and controllability compared with conventional static conditioning methods.
In summary, the paper makes three key contributions: (1) an empirical demonstration that emotional prosody resides chiefly in the SLM of hybrid TTS systems; (2) a systematic method for locating and exploiting highly linearly separable layers/operations to extract effective steering vectors; and (3) a novel evaluation protocol that quantifies mixed‑emotion synthesis and text‑emotion mismatch. Together, these advances provide a principled roadmap for building TTS systems that can generate human‑like, composable, and controllable emotional speech without retraining the underlying model.
Comments & Academic Discussion
Loading comments...
Leave a Comment