HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

While the field of co-speech gesture generation has seen significant advances, producing holistic, semantically grounded gestures remains a challenge. Existing approaches rely on external semantic retrieval methods, which limit their generalisation capability due to dependency on predefined linguistic rules. Flow-matching-based methods produce promising results; however, the network is optimised using only semantically congruent samples without exposure to negative examples, leading to learning rhythmic gestures rather than sparse motion, such as iconic and metaphoric gestures. Furthermore, by modelling body parts in isolation, the majority of methods fail to maintain crossmodal consistency. We introduce a Contrastive Flow Matching-based co-speech gesture generation model that uses mismatched audio-text conditions as negatives, training the velocity field to follow the correct motion trajectory while repelling semantically incongruent trajectories. Our model ensures cross-modal coherence by embedding text, audio, and holistic motion into a composite latent space via cosine and contrastive objectives. Extensive experiments and a user study demonstrate that our proposed approach outperforms state-of-the-art methods on two datasets, BEAT2 and SHOW.

💡 Research Summary

The paper addresses a central limitation in co‑speech gesture generation: the inability to produce holistic, semantically grounded gestures that include sparse, iconic, and metaphoric movements. Existing approaches either rely on external semantic retrieval pipelines, which restrict generalisation, or they employ flow‑matching or diffusion models that are trained only on semantically congruent samples. Consequently, these models tend to generate rhythmic, beat‑like gestures that lack semantic specificity and often suffer from cross‑modal inconsistency because body parts are processed in isolation.
To overcome these issues, the authors propose HolisticSemGes, a two‑stage framework that (1) learns a shared semantic latent space for audio, text, and full‑body motion via the Semantics‑Aware Composite Module (SACM), and (2) trains a Contrastive Flow Matching (CFM) model that explicitly uses mismatched audio‑text pairs as negative examples.
Stage 1 – Motion Prior Learning. Each body region (hands, upper body, lower body, face) is encoded with a hierarchical Residual VQ‑VAE (RVQ‑VAE). The encoder compresses SMPL‑X joint trajectories into discrete token sequences; a decoder reconstructs the motion, and a composite loss (reconstruction + VQ commitment) ensures high fidelity while preserving a compact latent representation. All regional latents are concatenated and normalised to form a holistic motion latent Z_G.
Stage 2 – Semantic Alignment (SACM). Text transcripts are embedded with BERT, audio waveforms with HuBERT, and both are projected into a common d‑dimensional space using modality‑specific heads followed by ℓ₂‑normalisation. A weighted barycentric fusion (parameter α) combines the text and audio embeddings into a fused semantic vector \bar{Z}. Two alignment objectives are applied: (i) a sequence‑level cosine loss that aligns \bar{Z} and the motion latent across all time steps, and (ii) a CLIP‑style InfoNCE loss that treats other samples in the mini‑batch as negatives, encouraging discriminative cross‑modal embeddings. The total SACM loss L_sem = λ_cos L_cos + λ_clp L_clp simultaneously reduces semantic drift and enforces cross‑region coherence.
Contrastive Flow Matching. Standard flow‑matching learns a deterministic velocity field that transports Gaussian noise Z₀ to the target motion Z₁ via linear interpolation Z_t = (1‑t)Z₀ + tZ₁. The authors augment this with a contrastive term: for each training step, a mismatched audio‑text pair (e.g., audio from utterance i with text from utterance j) is fed to the conditioning network, producing a “negative” trajectory. The velocity field is trained to follow the true trajectory for the matched pair while being repelled from the negative trajectory. This contrastive pressure makes the generation path semantically faithful and stabilises training, especially for sparse gestures that would otherwise be averaged out.
Multimodal conditioning is performed by concatenating the fused audio‑text embedding O and feeding it, together with the noisy motion latent Z_t, into a Temporal Cross‑Attention Module (TCAM). A learned body‑structure positional embedding p is added to preserve intra‑frame skeletal relationships. The TCAM output Z_s is then decoded by the motion decoder to obtain the final SMPL‑X parameters.
Experiments. The method is evaluated on two benchmark datasets: BEAT2, a conversational gesture corpus, and SHOW, a presentation‑style speech corpus. Quantitative metrics include Fréchet Inception Distance (FID), diversity scores, and a novel Motion‑Text Alignment Score based on cosine similarity of the learned embeddings. Human evaluations assess naturalness, semantic appropriateness, and full‑body consistency. HolisticSemGes outperforms strong baselines—Diffusion‑based DiSHEG, standard Flow‑Matching GestureLSM, and semantic‑aware models such as GestureDiffCLIP and SemTalk—by a substantial margin on all metrics. Notably, the generation of iconic/metaphoric gestures improves dramatically, confirming that the contrastive component successfully prevents the model from collapsing to generic rhythmic motion. Ablation studies show that removing SACM degrades cross‑modal alignment, while omitting the contrastive term leads to semantically ambiguous gestures, underscoring the complementary nature of the two modules.
Limitations and Future Work. The negative samples are created by random audio‑text swapping; more sophisticated semantic negatives (e.g., antonymic sentences) could further sharpen the contrastive signal. The approach relies on pre‑trained BERT and HuBERT models, which may limit applicability to low‑resource languages. Future directions include integrating large‑scale multimodal pre‑training, exploring adaptive α weighting, and extending the framework to real‑time interactive settings.
In summary, HolisticSemGes introduces a principled way to embed semantics directly into the generative flow, achieving holistic, semantically grounded co‑speech gestures with improved diversity, realism, and cross‑modal consistency. The extensive experiments validate its superiority and open new avenues for research in multimodal human‑computer interaction.

HolisticSemGes: Semantic Grounding of Holistic Co-Speech Gesture Generation with Contrastive Flow-Matching

💡 Research Summary

Comments & Academic Discussion

Leave a Comment