Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy’s latent state, combined with its planned actions, already encodes sufficient information to anticipate action outcomes, making visual simulation redundant for failure prevention. To this end, we introduce DILLO (DIstiLLed Language-ActiOn World Model), a fast steering layer that shifts the paradigm from “simulate-then-act” to “describe-then-act.” DILLO is trained via cross-modal distillation, where a privileged Vision Language Model teacher annotates offline trajectories and a latent-conditioned Large Language Model student learns to predict semantic outcomes. This creates a text-only inference path, bypassing heavy visual generation entirely, achieving a 14x speedup over baselines. Experiments on MetaWorld and LIBERO demonstrate that DILLO produces high-fidelity descriptions of the next state and is able to steer the policy, improving episode success rate by up to 15 pp and 9.3 pp on average across tasks.


💡 Research Summary

The paper tackles the critical problem of enabling safety‑critical agents to anticipate the consequences of their actions before execution, a requirement that traditional model‑predictive control (MPC) and model‑based reinforcement learning address through visual world simulation. Existing visual world models, however, suffer from prohibitive latency—often several seconds per decision—making them unsuitable for real‑time control on embedded platforms.

The authors propose the “Latent Sufficiency Hypothesis”: a policy’s internal latent representation (zₜ), produced by its encoder from raw observations, already contains the task‑relevant information (object geometry, relative distances, contact dynamics) needed to predict future outcomes. Consequently, visual rendering of future frames is redundant for proactive failure prevention.

To test this hypothesis, they introduce DILLO (Distilled Language‑Action World Model), a two‑stage teacher‑student framework that replaces visual simulation with a text‑only prediction pipeline. The teacher is a privileged Vision‑Language Model (VLM) that has full access to the simulator’s RGB frames, 6‑DoF object and end‑effector poses, and binary success signals. For each transition, the teacher generates a natural‑language description (d_T) of the expected physical interaction and a binary verdict (c_T ∈ {Positive, Negative}) indicating whether the action chunk advances the task.

The student is a lightweight Large Language Model (LLM) from the Gemma family (1 B or 4 B parameters). It receives only the policy’s latent state zₜ and a chunk of planned actions aₜ:ₜ₊ₖ. Two learnable linear projectors, P_z and P_a, map these continuous vectors into the LLM’s embedding space, forming an input sequence “


Comments & Academic Discussion

Loading comments...

Leave a Comment