Parallel Stochastic Gradient-Based Planning for World Models

Parallel Stochastic Gradient-Based Planning for World Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

World models simulate environment dynamics from raw sensory inputs like video. However, using them for planning can be challenging due to the vast and unstructured search space. We propose a robust and highly parallelizable planner that leverages the differentiability of the learned world model for efficient optimization, solving long-horizon control tasks from visual input. Our method treats states as optimization variables (“virtual states”) with soft dynamics constraints, enabling parallel computation and easier optimization. To facilitate exploration and avoid local optima, we introduce stochasticity into the states. To mitigate sensitive gradients through high-dimensional vision-based world models, we modify the gradient structure to descend towards valid plans while only requiring action-input gradients. Our planner, which we call GRASP (Gradient RelAxed Stochastic Planner), can be viewed as a stochastic version of a non-condensed or collocation-based optimal controller. We provide theoretical justification and experiments on video-based world models, where our resulting planner outperforms existing planning algorithms like the cross-entropy method (CEM) and vanilla gradient-based optimization (GD) on long-horizon experiments, both in success rate and time to convergence.


💡 Research Summary

The paper introduces GRASP (Gradient RelAxed Stochastic Planner), a novel planning algorithm designed for visual world‑model based control tasks. Traditional planners for learned world models fall into two categories. Zero‑order methods such as the Cross‑Entropy Method (CEM) or MPPI rely on random sampling of action sequences and evaluate them via roll‑outs; they are robust but become sample‑inefficient as the horizon length and action dimensionality increase. Gradient‑based planners exploit the differentiability of the world model (F_\theta) to directly optimize actions, offering better sample efficiency, yet they suffer from two fundamental issues: (1) back‑propagation through many sequential model calls leads to numerical instability and poor conditioning, and (2) in high‑dimensional visual latent spaces the gradients with respect to the state input (\nabla_s F_\theta) are often brittle or adversarial, causing the optimizer to exploit unrealistic directions in state space.

GRASP addresses these problems by “lifting” the optimization: instead of treating the state trajectory as a deterministic rollout, it introduces auxiliary “virtual states” (z_1,\dots,z_T) as independent optimization variables. The dynamics consistency is enforced softly through a loss term
\


Comments & Academic Discussion

Loading comments...

Leave a Comment