BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .


💡 Research Summary

BridgeV2W tackles three fundamental shortcomings of current embodied world models: the action‑video gap, viewpoint sensitivity, and the lack of a unified architecture across different robot embodiments. The core idea is to convert low‑dimensional coordinate‑space actions (e.g., end‑effector poses or joint angles) into high‑dimensional pixel‑aligned “embodiment masks”. These masks are rendered from the robot’s URDF together with known camera intrinsics and extrinsics, producing a per‑frame silhouette that directly reflects the robot’s geometry and motion from the current viewpoint. By feeding these masks into a pretrained video generation model via a ControlNet‑style conditioning branch, BridgeV2W aligns the conditioning space with the model’s visual priors, preserves viewpoint‑specific information, and enables a single architecture to handle both single‑arm and dual‑arm systems.

The video generation backbone is CogVideoX‑5B‑I2V, a diffusion‑based image‑to‑video model that uses a 3‑D VAE encoder/decoder and a Diffusion Transformer (DiT) for denoising latent video representations. The mask sequence is encoded by the same VAE into latent masks, which are injected into selected DiT blocks through zero‑initialized convolutional layers, following the ControlNet paradigm. This design keeps the pretrained weights largely unchanged at the start of fine‑tuning while allowing the model to gradually learn to respect the spatial guidance provided by the masks.

Training combines three loss terms: (1) the standard diffusion loss that predicts latent velocity, (2) a dynamics‑consistency loss that enforces coherent motion across multiple temporal offsets in latent space, and (3) a novel flow‑based motion loss. The flow loss uses a frozen RAFT optical‑flow estimator to compute both direction (cosine similarity) and magnitude (Huber) discrepancies between predicted and ground‑truth videos, focusing supervision on regions that actually move (the robot body and manipulated objects) and reducing over‑fitting to static backgrounds. The flow loss is activated after an initial warm‑up period to avoid destabilizing early training.

Experiments are conducted on two robotic datasets: DROID (single‑arm manipulation) and AgiBot‑G1 (dual‑arm manipulation). Both datasets contain diverse scenes, unseen camera viewpoints, and varying lighting conditions. BridgeV2W is compared against several baselines, including Action‑Conditioned Video Diffusion, Video‑World‑Model, and prior ControlNet‑based approaches. Quantitative metrics (PSNR, SSIM, FVD) show that BridgeV2W consistently outperforms baselines, achieving 2–3 dB higher PSNR, 0.02–0.04 higher SSIM, and roughly 15 % lower FVD. Qualitative results demonstrate sharper object motion and better preservation of robot geometry across novel viewpoints.

Beyond video quality, the paper evaluates downstream utility. In a policy‑evaluation setting, the correlation between predicted video success (derived from a simple classifier on generated frames) and real‑world execution success reaches r ≈ 0.85, indicating that the model’s predictions are reliable proxies for actual performance. In a goal‑conditioned planning task, BridgeV2W is used to generate candidate future videos conditioned on target images; a planner selects actions that lead to videos most similar to the goal. This pipeline achieves over 70 % success on both single‑arm and dual‑arm tasks, outperforming baselines that lack mask conditioning or flow supervision.

The authors discuss limitations: the approach relies on accurate URDF models and calibrated camera parameters; errors in these inputs can degrade mask quality. The flow loss depends on a pretrained RAFT model, which may struggle with extremely fast motions or severe illumination changes. Future work could explore learning the mask rendering pipeline end‑to‑end, integrating self‑supervised flow estimation, and extending the method to mobile robots, human‑robot interaction, or multi‑modal sensing.

In summary, BridgeV2W presents a principled and practical solution for bridging large‑scale pretrained video generation models with embodied world modeling. By translating actions into pixel‑aligned masks, injecting them via ControlNet, and emphasizing dynamic regions through a flow‑based loss, the framework delivers robust, viewpoint‑agnostic video predictions and demonstrates tangible benefits for downstream robotic decision‑making. This work opens a clear pathway for leveraging the vast visual and motion priors embedded in internet‑scale video models within the robotics domain.


Comments & Academic Discussion

Loading comments...

Leave a Comment