ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance

ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Existing gesture generation methods primarily focus on upper body gestures based on audio features, neglecting speech content, emotion, and locomotion. These limitations result in stiff, mechanical gestures that fail to convey the true meaning of audio content. We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures. Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise, avoiding melody distortion and guiding results towards specified emotions. Moreover, aligning semantic and gestures in the latent space provides better generalization capabilities. ExpGest, a diffusion model-based gesture generation framework, is the first attempt to offer mixed generation modes, including audio-driven gestures and text-shaped motion. Experiments show that our framework effectively learns from combined text-driven motion and audio-induced gesture datasets, and preliminary results demonstrate that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.


💡 Research Summary

ExpGest introduces a diffusion‑based framework that simultaneously leverages textual transcripts and acoustic signals to generate expressive full‑body co‑speech gestures. The authors identify three major shortcomings of prior work: (1) most methods focus only on upper‑body motion, (2) they ignore speech content, emotion, and locomotion, and (3) they treat audio features as a monolithic cue, leading to stiff, mechanical gestures. ExpGest addresses these gaps by (i) unifying diverse motion datasets (BEAT, AMASS, 100‑STYLE) into a common SMPL‑X‑based representation (rot6D rotation + 3‑D position, velocity, and contact signals) yielding a 994‑dimensional per‑frame feature vector; (ii) synthesizing 20 K artificial text‑audio‑motion pairs to compensate for the scarcity of mixed‑modality data; and (iii) designing a diffusion process that respects human kinematic constraints, replacing the usual noise‑prediction step with a reconstruction of the original human pose at each denoising iteration.

The diffusion model uses a 12‑layer Transformer encoder to predict noise ϵθt given (a) the current noise step t, (b) a seed pose, (c) a CLIP‑encoded text description, (d) a WaveLM‑encoded audio sequence (interpolated to match gesture frame rate), and (e) a semantic latent code produced by a dedicated alignment module. The forward diffusion adds Gaussian noise according to a standard β‑schedule; the reverse process reconstructs the clean gesture ˆx0 from noisy x t, then recomposes x t‑1 using the reconstructed pose, ensuring physical plausibility.

A key insight is the differential sensitivity of hands and arms to melodic versus semantic cues. Empirical observation shows that subtle pitch variations primarily affect hand articulation, whereas semantic emphasis (e.g., “important”) drives larger arm swings. ExpGest encodes this insight by assigning distinct weights to melody and semantics for hand and arm streams, allowing the model to generate gestures that faithfully follow both prosody and meaning.

Semantic alignment is achieved through a contrastive latent‑space learning scheme. A VAE‑based gesture encoder and a BERT‑based transcript encoder map gestures and transcriptions into a shared embedding space. Global average pooling extracts a compact representation, and NT‑Xent loss maximizes similarity for matching pairs while minimizing it for mismatched pairs. After training, only the transcript encoder is kept and fed into the diffusion model, guaranteeing that generated gestures remain semantically consistent with the input text.

Emotion control is handled via a novel “noise emotion classifier”. Instead of concatenating one‑hot emotion vectors to the conditioning, the authors train a classifier on noisy gesture–emotion pairs. During sampling, the classifier evaluates the current noisy gesture x t, computes the gradient of an emotion loss with respect to x t, and updates x t by a scaled gradient step (α·∇L). This decouples emotion guidance from the diffusion graph, preserving the original semantic and melodic information while steering the motion toward the desired affective style. The approach also enables smooth transitions between emotions, which one‑hot embeddings cannot provide.

Training details: motion data are down‑sampled to 20 FPS, truncated or padded to a uniform length of 180 frames, and the maximum CLIP text length is set to 20 tokens. The diffusion process runs for 1 000 steps; the model is trained for roughly 72 hours on four NVIDIA V100 GPUs (≈800 K steps). The loss function combines a Huber reconstruction loss on the predicted clean pose with the semantic alignment and emotion guidance components.

Evaluation uses three metrics: (1) Fréchet Gesture Distance (FGD) to measure distributional similarity between generated and real gestures; (2) a newly proposed Semantic Alignment (SA) score, computed as the cosine similarity between averaged latent embeddings of generated gestures and corresponding audio transcripts; and (3) emotion accuracy, assessing how well the generated motion matches the target affect. On the BEAT dataset (English subset, 76 h of multimodal speech across 8 emotions and 4 languages) plus locomotion data from AMASS and 100‑STYLE, ExpGest achieves lower FGD, higher SA, and superior emotion alignment compared with state‑of‑the‑art baselines such as DiffStyleGesture and Emog. Qualitative examples demonstrate expressive hand gestures synchronized with pitch contours, arm movements aligned with semantic emphasis, and coherent full‑body locomotion when a textual description of walking or running is provided.

In summary, ExpGest makes four major contributions: (i) a unified full‑body gesture generation framework that accepts audio‑only, text‑only, or mixed audio‑text inputs; (ii) a hand‑arm decoupling strategy that assigns melody‑relevant weights to hands and semantics‑relevant weights to arms; (iii) a contrastive semantic alignment module that ensures generated motions faithfully reflect textual meaning; and (iv) a gradient‑based noise emotion classifier that provides smooth, controllable affect without compromising content. The paper also discusses limitations, notably the reliance on synthetic mixed‑modality data and the need for inference‑time speed optimizations for real‑time applications. Nonetheless, ExpGest represents a significant step toward more natural, expressive, and controllable virtual speakers for entertainment, education, and human‑computer interaction.


Comments & Academic Discussion

Loading comments...

Leave a Comment