FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency. Unlike existing methods that rely on chunk-by-chunk or auto-regressive model with diffusion head, we adopt a diffusion forcing framework to model this time-series generation task under time-varying control events. We find that a straightforward implementation of vanilla diffusion forcing (as proposed for video models) fails to model real motion distributions. We demonstrate that to guarantee modeling the output distribution, the vanilla diffusion forcing must be tailored to: (i) train with a bi-directional attention instead of casual attention; (ii) implement a lower triangular time scheduler instead of a random one; (iii) utilize a continues time-varying way to introduce text conditioning. With these improvements, we demonstrate in the first time that the diffusion forcing-based framework achieves state-of-the-art performance on the streaming motion generation task, reaching an FID of 0.057 on the HumanML3D benchmark. Models, code, and weights are available. https://shandaai.github.io/FloodDiffusion/
💡 Research Summary
FloodDiffusion introduces a novel framework for real‑time, text‑driven streaming human motion generation. The problem setting differs from traditional text‑to‑motion tasks: instead of a static caption, the system receives a sequence of time‑varying prompts (e.g., “raise knees” followed by “squats”) and must instantly reflect each new instruction while maintaining smooth, physically plausible motion. Existing streaming approaches either rely on chunk‑by‑chunk diffusion (e.g., PRIMAL) which incurs high “first‑token” latency because a full chunk must be filled before any output can be produced, or on auto‑regressive models with a diffusion head (e.g., MotionStreamer) that struggle to exploit long‑range motion history effectively.
The authors turn to diffusion forcing, a technique originally proposed for video generation where each frame is assigned a distinct noise level, allowing flexible, frame‑wise denoising. A naïve application of vanilla diffusion forcing to motion fails to capture the true motion distribution. To remedy this, FloodDiffusion makes three essential modifications, each backed by theoretical analysis:
-
Bidirectional attention – In streaming, the active window at time t contains not only past frames but also frames that have not yet been fully denoised. A strictly causal mask would discard useful future context within this window, preventing the model from incorporating the most recent text prompt. By employing bidirectional self‑attention inside the active window, the model can attend to all frames that are currently “alive,” ensuring that the newest textual instruction immediately influences the denoising of all relevant frames.
-
Lower‑triangular (cascading) time schedule – Vanilla diffusion forcing samples random timesteps for each frame, which creates a mismatch between training and inference schedules and destroys the clean factorization needed for exact likelihood. The authors define vectorized schedules αₖ(t)=clamp(t−k·nₛ,0,1) and βₖ(t)=1−αₖ(t), where nₛ is the streaming step size. This yields a deterministic, lower‑triangular pattern: at any global time t, frames with index k < m(t) are already fully denoised (αₖ=1, βₖ=0), frames with k ≥ n(t) are still pure noise (αₖ=0, βₖ=1), and only frames in the interval
Comments & Academic Discussion
Loading comments...
Leave a Comment