SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.

💡 Research Summary

The paper “SafePred: A Predictive Guardrail for Computer-Using Agents via World Models” addresses a critical limitation in the safety of autonomous Computer-Using Agents (CUAs). Current state-of-the-art guardrails are predominantly reactive; they evaluate actions based on the immediate, observable context and can prevent short-term risks like clicking malicious links. However, they fail to anticipate long-term, delayed risks where an action that seems benign initially (e.g., upgrading a system’s Python version to run a project) can lead to severe, irreversible consequences later (e.g., breaking OS-level dependencies). To bridge this gap, the authors propose a paradigm shift towards predictive guardrails, with the core principle of aligning predicted future risks with current decision-making.

The introduced framework, SafePred, operationalizes this concept through a three-stage pipeline that forms a continuous risk-to-decision loop. First, the Policy Integration module processes unstructured safety documents (e.g., organizational policies) into a structured, deduplicated, and goal-aligned set of rules. This provides a consistent and extensible foundation for risk evaluation. Second, the core Risk Prediction via World Model module leverages a Large Language Model (LLM) as a world model. Given the current state, a candidate action, task intent, policies, and recent history, the world model performs multi-scale prediction. It generates both a short-term prediction (the immediate observable state change) and a long-term prediction (a high-level, semantic description of the action’s impact on the overall task progression, such as causing irreversible deviation). Crucially, it then grounds these predictions in the integrated policy set to evaluate which policies would be violated, producing a risk score and explanatory signals. This approach avoids the unreliability of explicit multi-step state rollouts for long-term forecasting.

Third, the Decision Optimization module translates these risk signals into actionable guidance for the CUA. If the risk is low, the action is passed for execution. If a high-risk action is predicted, the module does not merely reject it. Instead, it generates hierarchical feedback: step-level risk guidance detailing the violated policies and reasons, and task-level plan guidance suggesting modifications to the agent’s overall plan. This prompts the agent to re-reason and re-plan, actively guiding it towards safer alternatives rather than just filtering bad options.

Extensive experimental evaluation on the OS-Harm and WASP benchmarks demonstrates the framework’s effectiveness. SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance across benchmarks. Importantly, through its decision optimization that helps agents find safer successful paths, it also improves task completion utility by up to 21.4% compared to reactive baselines, showing that safety and efficiency can be synergistic. Furthermore, the authors distill knowledge from SafePred’s operation into a lightweight, fine-tuned model called SafePred-8B, which achieves safety performance comparable to much larger frontier models like DeepSeek-V3.2, proving the practical deployability of the approach.

In summary, SafePred presents a novel, world model-based predictive guardrail framework that moves beyond reactive safety checks. By enabling agents to foresee and avoid long-term, latent risks through integrated risk prediction and optimized decision guidance, it offers a robust solution for ensuring the safe operation of CUAs in complex, real-world environments.

SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment