Riemannian Motion Generation: A Unified Framework for Human Motion Representation and Generation via Riemannian Flow Matching
Human motion generation is often learned in Euclidean spaces, although valid motions follow structured non-Euclidean geometry. We present Riemannian Motion Generation (RMG), a unified framework that represents motion on a product manifold and learns dynamics via Riemannian flow matching. RMG factorizes motion into several manifold factors, yielding a scale-free representation with intrinsic normalization, and uses geodesic interpolation, tangent-space supervision, and manifold-preserving ODE integration for training and sampling. On HumanML3D, RMG achieves state-of-the-art FID in the HumanML3D format (0.043) and ranks first on all reported metrics under the MotionStreamer format. On MotionMillion, it also surpasses strong baselines (FID 5.6, R@1 0.86). Ablations show that the compact $\mathscr{T}+\mathscr{R}$ (translation + rotations) representation is the most stable and effective, highlighting geometry-aware modeling as a practical and scalable route to high-fidelity motion generation.
💡 Research Summary
The paper introduces Riemannian Motion Generation (RMG), a novel framework that treats human motion as points on a product Riemannian manifold rather than in a flat Euclidean space. The authors decompose each motion frame into two natural factors: global translation (T) residing in ℝ³ and a set of per‑joint rotations (R) represented by unit quaternions on the 3‑sphere S³. By stacking J joint quaternions they obtain the rotation manifold (S³)ᴶ, yielding the overall motion manifold M = ℝ³ × (S³)ᴶ. This representation is intrinsically scale‑free and normalized, eliminating the need for external mean‑std normalization and dramatically reducing dimensionality compared with common 6‑D rotation encodings.
To generate motions, RMG adopts Riemannian flow matching, a continuous‑time generative technique that learns a time‑dependent velocity field on the manifold. For a data sample x₁ and a prior sample x₀ drawn from a Riemannian Gaussian (wrapped Gaussian centered at the rest pose), the method constructs the geodesic interpolation xₜ = Expₓ₀(t·Logₓ₀(x₁)). The target velocity at any intermediate state is vₜ(xₜ|x₁) = (1/(1‑t))·Logₓₜ(x₁), which lives in the tangent space TₓₜM. A neural network v_θ predicts this velocity; its Euclidean output is projected onto the tangent space via the appropriate projection operator. Training minimizes the mean‑squared error between predicted and target tangent velocities. At inference, the learned vector field is integrated using a Riemannian Euler (or higher‑order) scheme: xₜ₊ₕ = Expₓₜ(h·Π_TₓₜM v_θ(xₜ, t)), guaranteeing that every intermediate state remains on the manifold.
The prior distribution is a product of a standard Gaussian on ℝ³ for translation and independent wrapped Gaussians on each S³ factor for rotations. The mean is set to the neutral rest pose (zero translation, identity quaternion), ensuring that sampled priors correspond to plausible static skeletons.
Extensive experiments are conducted on two benchmarks. On HumanML3D, a text‑to‑motion dataset, RMG achieves an FID of 0.043, the best reported to date, and ranks first across all MotionStreamer metrics. On the much larger MotionMillion dataset, RMG attains FID = 5.6 and R@1 = 0.86, surpassing strong diffusion‑based baselines. Ablation studies reveal that the compact (T + R) representation is the most stable; adding pre‑shape (P) or temporal‑difference (d·) components yields diminishing returns and sometimes harms training stability. Moreover, using unit quaternions instead of 6‑D rotation representations simplifies geodesic computation, avoids re‑orthogonalization, and improves sampling fidelity.
The authors argue that human motion inherently lives on a low‑dimensional product manifold, and that respecting this geometry during both representation and generative modeling yields superior sample quality, training efficiency, and physical plausibility. Limitations include the current focus on translation and rotation only—contact dynamics, muscle forces, and multi‑person interactions are not explicitly modeled. Future work may extend the manifold to incorporate contact manifolds, explore higher‑order integration schemes, and integrate non‑visual sensor modalities.
In summary, RMG demonstrates that geometry‑aware modeling via Riemannian manifolds and flow matching is not only theoretically elegant but also practically scalable, setting a new direction for high‑fidelity human motion generation.
Comments & Academic Discussion
Loading comments...
Leave a Comment