MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
💡 Research Summary
**
The paper introduces MVISTA‑4D, a novel “imagine‑then‑act” framework that generates view‑consistent 4D (spatio‑temporal) RGB‑D predictions from a single‑view observation and converts these predictions into executable robot actions. Existing world‑model approaches for manipulation either operate purely in image space or rely on partial 3D representations, which limits their ability to capture complete scene geometry and leads to inconsistencies when planning actions. MVISTA‑4D addresses these gaps through three core technical contributions.
First, the authors design a cross‑view, cross‑modality feature fusion pipeline built on a latent video diffusion model. Each view’s RGB and depth maps are encoded by a 3‑D VAE into latent tensors. Within a view, RGB and depth tokens are concatenated width‑wise so that they become adjacent in the token sequence, enabling a lightweight local cross‑modality attention module to exchange information between appearance and geometry. Learnable modality tokens are added to each stream, and gated residual updates (γ_app, γ_geo) control the strength of the exchange, suppressing noisy or mis‑aligned cues. Across views, tokens are concatenated height‑wise, allowing a variable number of viewpoints to be processed simply by stacking additional streams.
Second, to enforce geometric consistency across viewpoints, the authors replace naïve extrinsic flattening with a compact 13‑dimensional camera embedding. This embedding encodes spherical coordinates (yaw, pitch, roll) and the logarithm of the distance to a common look‑at point, making scale information explicit and disentangled from rotation. Using these view tokens, a geometry‑aware deformable cross‑view attention is applied: for each query token, the corresponding epipolar line in every other view is computed, and only K candidate locations along that line are sampled for multi‑head attention. This sparse, epipolar‑restricted attention drastically reduces computational cost while preserving true geometric correspondences.
Third, the paper tackles the ill‑posed nature of inverse dynamics. Instead of predicting per‑step actions directly, MVISTA‑4D encodes an entire manipulation trajectory into a low‑dimensional latent code (z_traj). At test time, the generated 4D future is compared to a desired future (derived from the language instruction), and the loss is back‑propagated through the diffusion generator to optimize z_traj. The optimized latent serves as an initialization for a residual inverse dynamics model, which learns only small corrective action terms. This two‑stage process (latent optimization + residual refinement) yields actions that are both globally consistent with the imagined trajectory and locally accurate for execution.
Experiments are conducted on three benchmarks: a synthetic 4D video dataset, a newly collected real‑robot multi‑view dataset covering 14 manipulation tasks, and a public RGB‑D video benchmark. MVISTA‑4D outperforms state‑of‑the‑art baselines in RGB‑PSNR/SSIM, depth RMSE, 3D reconstruction IoU, and manipulation success rate. Notably, adding more viewpoints improves depth accuracy by over 30 % and raises success rates on complex tasks (e.g., stacking, insertion) from 85 % to 96 %. Ablation studies confirm that (a) cross‑modality attention, (b) geometry‑aware cross‑view attention, and (c) the trajectory‑latent plus residual IDM each contribute significantly to performance.
The authors acknowledge limitations: the diffusion inference is not yet real‑time, and integration with advanced physics simulators for deformable or fluid objects remains future work. Nonetheless, MVISTA‑4D demonstrates that a unified, geometry‑aware 4D world model combined with test‑time latent action optimization can bridge the gap between high‑fidelity imagination and reliable robot execution, paving the way for more capable, long‑horizon manipulation systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment