Computer-Using World Model

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

💡 Research Summary

The paper introduces the Computer‑Using World Model (CUWM), a novel world‑model architecture designed for desktop software environments where real execution is costly and counterfactual exploration is unavailable. CUWM predicts the next UI screenshot given the current screenshot and a candidate action by factorizing the prediction into two stages. In Stage 1, a vision‑language model (Qwen2.5‑VL) receives the current UI image and a natural‑language description of the action and outputs a concise textual transition Δt that describes the decision‑relevant changes (e.g., “column H becomes selected”, “encryption dialog appears”). This textual abstraction reduces the prediction space dramatically, focusing on what changes rather than how the whole screen looks. In Stage 2, a diffusion‑based conditional image‑editing model (Qwen‑Image‑Edit) takes the current screenshot together with Δt and synthesizes the next screenshot, preserving unchanged regions and rendering only the localized modifications described by Δt.

Training proceeds in two phases. First, supervised fine‑tuning uses the GUI‑360 dataset, which provides trajectories of (s_t, a_t, s_{t+1}) from agents interacting with Microsoft Office applications. An automated annotator (GPT‑5) generates ground‑truth textual descriptions ΔGT_t for each transition. The textual model is trained to predict ΔGT_t from (s_t, a_t), and the visual model is trained to reconstruct s_{t+1} from (s_t, Δt). This phase gives CUWM a faithful initialization of UI dynamics.

Because raw supervised loss does not guarantee that the generated text aligns with the structural aspects most important for planning (selection state, ribbon status, pane visibility), a lightweight reinforcement‑learning (RL) refinement is applied to the textual model. The model is treated as a policy that samples Δt; a reward combines an LLM‑as‑Judge score (evaluating correctness of UI structural elements) and a length penalty that encourages concise descriptions. Optimization uses Group Relative Policy Optimization (GRPO), encouraging the model to produce short yet accurate transition texts.

Evaluation is performed via test‑time action search. A frozen LLM‑based agent proposes a set of candidate actions from the current UI state. CUWM simulates the resulting next UI screenshot for each candidate; the agent then selects the action it deems best based on these simulated outcomes. No policy update occurs during inference, and the world model is used solely as a simulator, allowing additional computation at test time without risking real data corruption.

Experiments on Word, Excel, and PowerPoint tasks demonstrate that CUWM‑guided simulation substantially reduces error rates (by roughly 30 % on average) and improves task success rates (10–15 % gain) compared with a baseline that executes the first proposed action directly. The benefits are most pronounced in long‑horizon workflows where a single UI mistake can irrevocably damage artifacts. Qualitative examples show CUWM correctly handling localized changes such as column selection, dialog appearance, and ribbon highlighting, while preserving the rest of the screen.

The paper’s contributions are threefold: (1) a two‑stage, multimodal world‑model architecture that separates “what changes” from “how it appears” for desktop GUIs; (2) a training pipeline that combines automated textual annotation with structure‑aware RL refinement to produce concise, semantically accurate transition descriptions; (3) a test‑time simulation framework that improves decision quality of existing agents without modifying their policies. Limitations include difficulty handling multi‑window interactions, complex drag‑and‑drop gestures, and non‑rectangular graphic elements, as well as inference latency introduced by diffusion‑based image synthesis. Future work is suggested to incorporate more sophisticated multimodal attention, richer UI element parsing, and lightweight image generation to enable real‑time deployment.

Computer-Using World Model

💡 Research Summary

Comments & Academic Discussion

Leave a Comment