From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation
Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git
💡 Research Summary
The paper tackles the persistent performance gap between framework‑based GUI agents and end‑to‑end (E2E) screenshot‑to‑action policies. While framework agents achieve 30‑40% success on challenging benchmarks such as OSWorld‑Verified, E2E models like UIT‑ARS1.5‑7B linger around 20‑24%. The bottleneck stems from two facts: (1) only a few hundred verifiable tasks exist, making large‑scale supervised data scarce, and (2) expert trajectories must be collected by running strong framework agents in the actual environments, which is costly and limits the amount of high‑quality data. The authors ask how to best exploit a small pool of expert traces within reinforcement learning from verifiable rewards (RLVR).
A naïve approach—mixing off‑policy expert traces directly into on‑policy RLVR—fails because of (a) structural mismatch (expert traces contain planner, executor, and grounding artifacts, often in API‑level action spaces) and (b) distributional shift (even after format conversion, the trajectories lie far from the current policy’s manifold). This mismatch leads to exploration collapse and unstable updates.
The proposed solution, BEPA (Bi‑Level Expert‑to‑Policy Assimilation), introduces a two‑stage pipeline that converts static, mismatched expert data into policy‑aligned guidance. Level‑1 (Self‑Rolled Execution) abstracts each expert trajectory into a concise natural‑language plan pₓ, appends it to the original instruction, and lets the base policy πθ generate a new rollout under this plan. Successful rollouts (verified by the deterministic verifier R) become “self‑rolled” trajectories τ_selfₓ that are intrinsically compatible with πθ, thereby bridging the structural gap and reducing covariate shift. Level‑2 (Off‑Policy Assimilation) builds a per‑task cache Eₓ initialized with the Level‑1 trajectories. During GRPO training, a batch of N on‑policy rollouts is collected; if any succeed, the cache is refreshed with one of those successes. If the entire batch fails, the cached off‑policy trajectory τ_offₓ is injected into the batch, guaranteeing at least one positive signal for the optimizer. The mixed batch ˆTₓ is then used to compute group‑wide advantages, and a clipped surrogate objective (CLIP) keeps updates within a trust region.
Integration with GRPO is seamless: the mixed objective J_BEPA extends the standard GRPO loss to handle the hybrid batch, preserving stability while allowing occasional off‑policy guidance. Empirically, BEPA yields substantial gains. On OSWorld‑Verified, UIT‑ARS1.5‑7B’s overall success rises from 22.87% to 32.13%, and on a strictly held‑out split from 5.74% to 10.30%. Consistent improvements are observed on MMBench‑GUI and Online‑Mind2Web. Ablation studies show that the Level‑1 self‑rolled seed dramatically reduces the log‑likelihood gap between expert and policy, while Level‑2’s dynamic cache prevents distribution drift as the policy evolves.
In summary, BEPA addresses both structural and distributional mismatches by (1) re‑rolling expert plans under the current policy to generate reachable trajectories, and (2) maintaining a dynamically aligned off‑policy cache that is only consulted when on‑policy exploration fails. This design enables efficient use of a limited set of high‑quality expert demonstrations, leading to a notable leap in E2E GUI agent performance and narrowing the gap with sophisticated framework‑based systems. The work opens avenues for scaling expert‑guided RL in GUI domains, potentially combining with synthetic data generation or multi‑modal instruction handling in future research.
Comments & Academic Discussion
Loading comments...
Leave a Comment