ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning

Reading time: 4 minute
...

📝 Original Info

  • Title: ChronoDreamer: Action-Conditioned World Model as an Online Simulator for Robotic Planning
  • ArXiv ID: 2512.18619
  • Date: 2025-12-21
  • Authors: Zhenhao Zhou, Dan Negrut

📝 Abstract

We present ChronoDreamer, an action-conditioned world model for contact-rich robotic manipulation. Given a history of egocentric RGB frames, contact maps, actions, and joint states, ChronoDreamer predicts future video frames, contact distributions, and joint angles via a spatial-temporal transformer trained with MaskGIT-style masked prediction. Contact is encoded as depth-weighted Gaussian splat images that render 3D forces into a camera-aligned format suitable for vision backbones. At inference, predicted rollouts are evaluated by a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution. We train and evaluate on DreamerBench, a simulation dataset generated with Project Chrono that provides synchronized RGB, contact splat, proprioception, and physics annotations across rigid and deformable object scenarios. Qualitative results demonstrate that the model preserves spatial coherence during non-contact motion and generates plausible contact predictions, while the LLM-based judge distinguishes collision from non-collision trajectories.

💡 Deep Analysis

Figure 1

📄 Full Content

Robots operating in contact-rich environments require trajectory planning that respects both task objectives and collision constraints. Classical approaches rely on explicit geometric models and physics simulators, but high-fidelity simulation is often too slow for online replanning, and sim-to-real gaps persist even with careful calibration. Learned world models offer a complementary path: by predicting future states conditioned on actions, they enable planning by imagination at speeds compatible with real-time control. However, most video prediction models focus on visual plausibility and neglect the physical quantities-contact forces, friction modes, joint states-that determine whether a trajectory is safe.

This work addresses the gap between video generation and contact-aware planning. We introduce Chron-oDreamer, an action-conditioned world model that jointly predicts future RGB frames, contact maps, and joint angles. The model operates on discrete visual tokens from a pretrained encoder and uses a spatial-temporal transformer with factorized attention to maintain tractable complexity over long horizons. Contact is represented as a camera-aligned image via depth-weighted Gaussian splats, encoding force magnitude and direction in a form directly consumable by vision backbones. At inference time, predicted rollouts are passed to a vision-language model that reasons about collision likelihood, enabling rejection sampling of unsafe actions before execution.

• A contact encoding scheme that renders 3D contact forces as depth-weighted Gaussian splat images aligned with the robot’s egocentric camera, providing dense contact supervision in an image-native format.

• ChronoDreamer, a spatial-temporal transformer world model that predicts video tokens, contact tokens, and joint angles jointly via MaskGIT-style masked prediction with factorized vocabulary.

• Integration of world-model rollouts with an LLMbased collision judge that reasons over predicted frames and contact maps to filter unsafe action candidates online.

• Evaluation on DreamerBench, a multi-scenario dataset with rigid and deformable objects, demonstrating spatial coherence preservation and contactevent prediction.

State-of-the-art text-and image-conditioned video generators now produce photorealistic textures and intricate geometry, yet they still violate basic mechanics: objects interpenetrate, trajectories defy gravity, and energy appears or vanishes. For robotics, scientific visualization, or simulation-aware media, these failures are structural rather than aesthetic, and they motivate returning to the 1 world-model viewpoint in which a generator maintains an internal state and transition dynamics coherent enough to support imagination and planning. Classic systems such as World Models, PlaNet, and Dreamer-and large-scale successors like Genie-demonstrated that long-horizon behavior hinges on latent dynamics that stay faithful to first principles [1]. Reframing modern video generation through this lens suggests that perceptual fidelity must be paired with state updates governed by constraints resembling Newtonian, and more broadly physical, laws. One route to lawful dynamics is to make physically plausible motion the default attractor via architectural bias. PhyDNet and PhyLoNet introduce dual pathways that disentangle a PDE-inspired “physics” cell from a residual appearance branch; the former evolves smooth, conservative dynamics, while the latter handles nuisance visual factors [2]. Even though these designs predate diffusion models, the lesson transfers: inserting lightweight, physics-structured latent updates between denoising steps or constraining attention flows to respect locality regularizes temporal evolution, cuts down implausible transitions, and keeps the latent state easier to plan in-much like traditional world-model agents that rely on carefully structured transition models instead of unconstrained recurrence.

Another strand delegates state evolution to a mechanistic simulator and reserves the generator for photorealistic rendering. PhysGen exemplifies a staged pipeline in which geometry and material parameters are inferred from a single image and user command, a rigid-body simulator propagates contact-rich dynamics, and a diffusion renderer produces temporally consistent imagery guided by the simulated motion [3]. Variants insert deformable or continuum solvers, or periodically project denoising intermediates back onto the simulator’s feasible set, tightening constraints while preserving generative diversity. From the world-model perspective, these simulator-inthe-loop systems externally scaffold the transition model: a trusted dynamical engine maintains or corrects the latent state, whereas the generator focuses on appearance. Design choices revolve around the governing simulator class, how tightly it couples to the denoiser, and how physical parameters are estimated or adapted over time, but even simple staged interfaces

📸 Image Gallery

page_1.png page_2.png page_3.png

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut