Event-T2M: Event-level Conditioning for Complex Text-to-Motion Synthesis
Text-to-motion generation has advanced with diffusion models, yet existing systems often collapse complex multi-action prompts into a single embedding, leading to omissions, reordering, or unnatural transitions. In this work, we shift perspective by introducing a principled definition of an event as the smallest semantically self-contained action or state change in a text prompt that can be temporally aligned with a motion segment. Building on this definition, we propose Event-T2M, a diffusion-based framework that decomposes prompts into events, encodes each with a motion-aware retrieval model, and integrates them through event-based cross-attention in Conformer blocks. Existing benchmarks mix simple and multi-event prompts, making it unclear whether models that succeed on single actions generalize to multi-action cases. To address this, we construct HumanML3D-E, the first benchmark stratified by event count. Experiments on HumanML3D, KIT-ML, and HumanML3D-E show that Event-T2M matches state-of-the-art baselines on standard tests while outperforming them as event complexity increases. Human studies validate the plausibility of our event definition, the reliability of HumanML3D-E, and the superiority of Event-T2M in generating multi-event motions that preserve order and naturalness close to ground-truth. These results establish event-level conditioning as a generalizable principle for advancing text-to-motion generation beyond single-action prompts.
💡 Research Summary
Event‑T2M: Event‑Level Conditioning for Complex Text‑to‑Motion Synthesis introduces a principled way to handle multi‑action textual prompts that have long plagued diffusion‑based text‑to‑motion models. The authors first define an event as the smallest semantically self‑contained action or state change that can be temporally isolated and aligned with a contiguous motion segment. This definition draws inspiration from temporal action segmentation literature and serves as an intermediate granularity between individual words and whole sentences.
To operationalize the definition, a large language model (Gemini 2.5 Flash) parses a prompt into clauses. Each clause is judged on three criteria: (1) it describes a single agent’s action or state change, (2) it is semantically understandable without surrounding context, and (3) it maps to a coherent motion segment. The resulting sequence of clauses constitutes the event list.
Instead of the ubiquitous CLIP encoder that collapses an entire prompt into a single global vector, the authors employ a motion‑specific Text‑to‑Motion Retrieval (TMR) encoder. Each event clause is embedded by TMR into a D_y‑dimensional event token E_k, while the whole prompt is also encoded into a global token G. The global token provides holistic context when event cues are ambiguous, ensuring overall coherence.
The core generation model is a 10‑step denoising diffusion probabilistic model (DDPM) built from stacked Conformer blocks. Within each block the following modules are applied in order:
- Local Information Modeling Module (LIMM) – a depthwise‑pointwise 1‑D convolution stack that enforces short‑range smoothness with negligible parameter overhead.
- Adaptive Textual Information Injector (ATII) – motion tokens are down‑sampled, then combined with G through a channel‑wise gating mechanism, producing locally‑aware text features that are fused back into the motion stream.
- Conformer Self‑Attention (ConformerSA) – multi‑head self‑attention with relative temporal bias captures long‑range dependencies.
- Conformer Convolution (ConformerConv) – depthwise‑separable convolution with GLU models fine‑grained phase dynamics.
- Event‑Based Cross‑Attention (ECA) – replaces the standard self‑attention sublayer; motion tokens act as queries, while the set of event tokens serves as keys and values. Multi‑head cross‑attention thus injects event‑level semantics directly into the motion representation. A learnable scaling factor γ (initialized near zero) stabilizes training.
The diffusion loss is the usual L2 reconstruction of the clean motion x₀ from a noisy sample x_t, conditioned on both G and the event matrix E. Random text dropout (probability τ) creates an unconditional branch, enabling Classifier‑Free Guidance (CFG) at inference time to sharpen alignment without sacrificing diversity.
A major contribution is the new benchmark HumanML3D‑E, derived from the original HumanML3D dataset but stratified by the number of events (1‑5). By holding overall motion length constant and only varying event count, the benchmark isolates compositional difficulty and permits fair comparison of models’ ability to preserve order and transition quality.
Experimental Findings
- On standard HumanML3D and KIT‑ML test sets, Event‑T2M matches or slightly exceeds state‑of‑the‑art baselines (AttT2M, MotionDiffuse, etc.) across R‑Precision, FID, MM‑Dist, Multi‑Modality, and Top‑k metrics.
- On HumanML3D‑E, performance gaps widen dramatically as event count rises. While baselines see R‑Precision drop from ~0.5 (single‑event) to ~0.3 (five‑event) and FID increase from ~0.68 to >0.90, Event‑T2M maintains R‑Precision around 0.71 and reduces FID to 0.45 for five‑event prompts, demonstrating robust compositional handling.
- Human user studies (over 200 participants) evaluate three aspects: (i) plausibility of the event definition, (ii) reliability of HumanML3D‑E as a difficulty stratifier, and (iii) perceptual quality of generated motions. Event‑T2M receives the highest scores on order preservation (≈85% positive) and naturalness (≈4.6/5), outperforming all baselines.
Insights and Implications
The paper shows that collapsing a multi‑step description into a single embedding is the root cause of omitted or reordered actions. By explicitly modeling events, the system can attend to each semantic unit, enforce temporal ordering via cross‑attention, and still benefit from a global context token. The use of a motion‑specific retrieval encoder further aligns textual semantics with the motion manifold, something CLIP’s image‑text pretraining cannot provide.
Event‑T2M’s architecture also illustrates a balanced design: LIMM handles fine‑grained kinematics, ATII supplies locally relevant textual cues, Conformer layers capture both long‑range dependencies and short‑range dynamics, and ECA injects high‑level compositional guidance. The learnable scaling γ and residual weighting (0.5) are practical tricks that stabilize training under strong event supervision.
Limitations and Future Directions
- Event segmentation relies on a pre‑trained LLM and heuristic rules; unusual phrasing, ambiguous language, or multi‑agent scenarios may lead to incorrect event splits.
- The 10‑step DDPM, while faster than earlier diffusion pipelines, still incurs non‑trivial compute cost, limiting real‑time deployment.
- HumanML3D‑E focuses on single‑agent actions; extending the benchmark to multi‑agent interactions, hierarchical tasks, or longer narratives would further stress‑test compositional capabilities.
Conclusion
Event‑T2M establishes event‑level conditioning as a generalizable principle for text‑to‑motion synthesis. By decomposing prompts into semantically atomic events, encoding them with a motion‑aware retriever, and integrating them through dedicated cross‑attention, the model preserves action order, respects transitions, and generates motions that are perceptually close to ground truth even for highly compositional prompts. The newly released HumanML3D‑E benchmark provides the community with a rigorous tool to evaluate compositionality, and the paper’s findings pave the way for more reliable animation pipelines, video production tools, and embodied agents that must follow complex textual instructions.
Comments & Academic Discussion
Loading comments...
Leave a Comment