A conversational gesture synthesis system based on emotions and semantics
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.
💡 Research Summary
This paper addresses the critical bottleneck in creating realistic digital humans: generating co‑speech gestures that are both semantically appropriate and emotionally expressive. The authors build upon the DiffuseStyleGesture diffusion framework and introduce DeepGesture, a multimodal, classifier‑free diffusion model that conditions on four inputs: (1) a short seed motion segment, (2) raw audio, (3) a fast‑generated text transcription, and (4) an emotion label. By integrating fast text transcriptions as semantic conditioning and extending the denoising process with emotion‑guided classifier‑free diffusion, the system can produce gestures that align tightly with the spoken content while reflecting the desired affective state.
Technical contributions include:
- Multimodal Encoding – Audio is encoded with wav2vec‑2.0, text with a BERT‑large encoder, and emotions with a learned embedding. Each modality is injected into separate cross‑attention blocks of a UNet‑style denoising network, allowing independent feature extraction yet joint latent‑space fusion.
- Emotion‑Conditioned Classifier‑Free Diffusion – During training, emotion embeddings are randomly masked, enabling the model to learn both conditioned and unconditioned generation. At inference, any emotion vector can be supplied, supporting smooth interpolation between affective states.
- Seed‑Motion Prior – The initial N frames of a BVH skeleton are used as a motion prior, preserving pose continuity and reducing drift over long sequences. The model predicts the subsequent M frames, yielding a full‑length gesture sequence.
- Data Processing Pipeline – The ZeroEGGS dataset is pre‑processed into 75‑joint BVH representations (1141‑dimensional per frame). Speech, text, and emotion features are normalized and aligned temporally.
- Loss Functions – A combination of L2 reconstruction loss, an auxiliary emotion classification loss, and KL divergence for diffusion stability guides training.
Evaluation on ZeroEGGS demonstrates measurable improvements: human‑likeness scores rise from 7.2 to 8.1 (out of 10), semantic appropriateness improves from 0.68 to 0.74, and emotional consistency increases from 82 % to 90 %. The model also generalizes well to synthetic TTS audio, showing less than 3 % performance degradation, which is crucial for real‑world deployment where voice synthesis is common.
Beyond algorithmic advances, the authors implement a full Unity rendering pipeline. The generated BVH files are imported into Unity’s Mecanim system, where skeletal skinning and joint weighting are fine‑tuned to achieve smooth, real‑time animation at 60 fps. An interactive demo showcases live updates of gestures in response to user‑provided speech, text, and emotion inputs, highlighting the system’s suitability for interactive agents, virtual assistants, and educational avatars.
The paper’s contributions are threefold: (i) augmenting diffusion‑based gesture synthesis with semantic text conditioning, (ii) introducing emotion‑guided classifier‑free diffusion for controllable affect, and (iii) delivering an end‑to‑end pipeline from multimodal input to real‑time Unity visualization. Limitations include a fixed set of six emotion categories, which restricts nuanced affective expression, and limited validation on multilingual datasets. Nonetheless, DeepGesture represents a significant step toward fully multimodal, emotionally aware digital humans, bridging the gap between high‑fidelity facial rendering and expressive body motion.
Comments & Academic Discussion
Loading comments...
Leave a Comment