GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pi…
Authors: Rui Yang, Qianhui Wu, Zhaoyang Wang
GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a w a re Sup ervision and P a rtially V erifiable RL Rui Y ang 1 † , Qianhui W u 2 ∗ , Zhao y ang Wang 3 † , Hany ang Chen 1 , Ke Y ang 1 † , Hao Cheng 2 Huaxiu Y ao 3 , Baolin P eng 2 , Huan Zhang 1 , Jianfeng Gao 2 , T ong Zhang 1 1 UIUC, 2 Microsoft, 3 UNC-Chap el Hill h ttps://gui-libra.gith ub.io Abstract Op en-source nativ e GUI agen ts hav e made rapid progress in visual grounding and lo w-lev el action execution, y et they still lag b ehind closed-source systems on long-horizon navigation tasks that demand b oth high-lev el reasoning and precise actions. This gap stems from tw o limitations in the op en-source ecosystem: a shortage of high-quality , action-aligned reasoning data, and the direct adoption of generic p ost-training pip elines that ov erlo ok the unique challenges of GUI agents. W e identify t wo fundamental issues in these pip elines: (i) standard supervised fine-tuning (SFT) with long chain-of-though t (CoT) reasoning often hurts grounding accuracy , and (ii) step-wise RL VR-t yle training faces p artial verifiability , where m ultiple actions can b e correct at a given state but only a single demonstrated action is used for v erification. This causes reward ambiguit y and makes offline step-wise metrics weak predictors of online task success during RL training. In this work, we presen t GUI-Libra , a systematic study and tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, w e introduce a data construction and filtering pip eline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we prop ose action-awar e sup ervise d fine-tuning that mixes reasoning-then-action and direct-action sup ervision, and reweigh ts tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiabilit y , we iden tify the ov erlo oked imp ortance of KL regularization and show, b oth theoretically and empirically , that a KL trust region is critical for impro ving offline-to-online predictability; we further in tro duce success-adaptiv e scaling to down w eigh t unreliable negative gradients. Across diverse w eb and mobile b enchmarks, GUI-Libra consistently impro v es b oth step-wise accuracy and end-to-end task completion, while strengthening the alignment b et w een offline metrics and online p erformance. In particular, GUI-Libra-4B and GUI-Libra-8B improv e their base mo dels by +15.6% and +12.2% on AndroidW orld, +4.0% and +8.7% on Online-Mind2W eb, and +12.5% and +11.3% on W ebArena-Lite-v2, resp ectively . Our results suggest that carefully designed p ost-training and data curation can unlo c k significantly stronger task-solving capabilities without costly online data collection. W e release our dataset, co de, and models to facilitate further researc h on data-efficien t p ost-training for reasoning-capable GUI agen ts. 1 Intro duction Large vision–language mo dels (VLMs) hav e become a central building block for graphical user interface (GUI) agents ( Qin et al. , 2025b ; Xu et al. , 2025c ; Gou et al. , 2025 ; W ang et al. , 2025a ; Bai et al. , 2025a ), enabling autonomous systems to interpret visual interfaces and output executable actions to complete complex tasks across digital platforms. Among these approaches, native GUI agents ( Qin et al. , 2025b ) refer to a single end-to-end model that directly maps user instructions and observ ations to executable actions, without relying on external planners or separate grounding mo dules. Recent op en-source native agents hav e achiev ed substan tial progress in visual grounding and low-lev el action execution ( Xu et al. , 2025c ; W ang et al. , 2025a ; Liu et al. , 2025c ; W ang et al. , 2025d ), significantly narro wing the gap with proprietary systems. Despite these adv ances, nativ e GUI agents remain less effective at long-horizon decision making, where agents must reason o v er extended observ ation–action sequences and adapt their b eha vior reliably to achiev e user-sp ecified goals. Adv ancing nativ e GUI agents increasingly dep ends on effective p ost-training of GUI-centric VLMs, yet curren t approac hes face t wo in tertwined b ottlenec ks. The first is the scarcity of high-quality , action-aligned reasoning ∗ Corresponding Author. † W ork done during in ternship at Microsoft. Emails: ry21@illinois.edu , qianhuiwu@microsoft.com 1 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Figure 1 Ov erview of GUI-Libra. Using only a subset of existing open-source GUI tra jectories, we tac kle key limitations of prior training pip elines through action-aligned reasoning data curation, action-aw are SFT, and conserv ative RL, yielding consisten t gains on online b enc hmarks. data. Existing GUI na vigation datasets ( Li et al. , 2024 ; Zheng et al. , 2024 ; Xu et al. , 2025c ) often lack explicit rationales, contain only short or weakly grounded reasoning traces, or include noisy action lab els, providing limited sup ervision for learning robust and in terpretable p olicies. The second b ottleneck is the widespread use of generic p ost-training recip es that do not fully accoun t for the unique prop erties of GUI agents. Most existing op en-source pip elines rely on either sup ervised fine-tuning (SFT) on brief rationales ( W u et al. , 2025b ; Xu et al. , 2025c ) or reinforcement learning (RL) primarily targeting grounding accuracy ( Luo et al. , 2025 ; Lu et al. , 2025 ; Zhou et al. , 2025b ; Y ang et al. , 2025b ). In practice, these approaches exp ose a p ersistent tension b etw een reasoning and grounding: incorporating chain-of-though t (CoT) often degrades grounding p erformance, leading many methods to suppress explicit reasoning rather than addressing the underlying trade-off. Mean while, motiv ated by the success of RL from verifiable rewards (RL VR) in domains such as mathematical reasoning ( Shao et al. , 2024 ; Y u et al. , 2025 ), recent work ( Hong et al. , 2025 ; Y ang et al. , 2025c ) has explored step-wise RL for GUI agen ts. Ho wev er, these methods ov erlo ok a fundamental characteristic of GUI interaction, partial verifiabilit y: at each step, multiple actions may co rrectly advance the task, yet offline sup ervision verifies only a single demonstrated action . As a result, alternative v alid actions are ambiguously treated as failures, introducing biased gradients, destabilizing training, and w eak ening the connection b etw een offline ev aluation metrics and online task succe ss. T o address these challenges, we prop ose GUI-Libra, a unified p ost-training framework designed to strengthen decision making in native GUI agents. GUI-Libra is driven by three insights. (i) High-quality rationales and careful data filtering are essential for data-efficient learning, esp ecially b ecause op en-source GUI tra jectories are often noisy and weakly annotated. (ii) During SFT, CoT tokens can dominate the training loss and in terfere with grounding; effective learning therefore requires explicitly prioritizing action and grounding tok ens, which directly determine execution. (iii) Under partially verifiable rew ards, RL can b ecome unstable without conserv ative constraints: unlike standard RL VR settings where dropping KL regularization often helps ( Y u et al. , 2025 ; Liu et al. , 2025d ; Zhou et al. , 2025b ; Y ang et al. , 2025b ), GUI agents b enefit from mo derate, KL-regularized RL that mitigates rew ard ambiguit y and distribution shift, improving robustness and offline–online alignment. Guided b y these insights, GUI-Libra integrates action-a ware sup ervised fine-tuning (ASFT) with conserv ative reinforcemen t learning. T o alleviate the scarcity of high-quality reasoning data, w e dev elop a scalable construction and filtering pip eline and release a curated 81K GUI reasoning dataset with impro v ed alignmen t b et w een reasoning traces and executable actions. In ASFT, GUI-Libra trains on a mixture of reasoning- then-action and direct-action sup ervision, and applies action-a ware token reweigh ting to emphasize action and grounding tokens, reducing the grounding degradation caused by long CoT traces. In RL, GUI-Libra optimizes p olicies with GRPO ( Shao et al. , 2024 ) under mo derate KL regularization, and further introduces a 2 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 1 Comparison of existing training recip es from different p ersp ectives. The task type column indicates the training target of the released mo dels: Û denotes GUI grounding, denotes GUI na vigation (b oth online and offline), and denotes offline step-wise action prediction. Name T ask Type Reasoning SFT RL Open W eigh ts Op en Data Op en Code OS-Atlas ( W u et al. , 2025b ) Û ✗ ✓ ✗ ✓ ✓ ✗ AGUVIS ( Xu et al. , 2025c ) Û short ✓ ✗ ✓ ✓ ✓ UGround ( Gou et al. , 2025 ) Û ✗ ✓ ✗ ✓ ✓ ✓ ScaleCUA ( Liu et al. , 2025c ) Û long ✓ ✗ ✓ ✓ ✓ OpenCUA ( W ang et al. , 2025d ) Û long ✓ ✗ ✓ ✓ ✓ UI-T ARS ( Qin et al. , 2025b ) Û short ✓ ✓ ✓ ✗ ✗ GLM-4.1-V ( Hong et al. , 2025 ) Û long ✓ ✓ ✓ ✗ ✗ F erret-UI Lite ( Y ang et al. , 2025c ) Û long ✓ ✓ ✗ ✗ ✗ UI-R1 ( Lu et al. , 2025 ) Û short ✗ ✓ ✓ ✓ ✓ GUI-R1 ( Luo et al. , 2025 ) Û short ✗ ✓ ✓ ✓ ✓ GT A1 ( Y ang et al. , 2025b ) Û ✗ ✗ ✓ ✓ ✓ ✓ GUI-Libra (Ours) Û long ✓ ✓ ✓ ✓ ✓ success-adaptiv e negative gradient scaling strategy to reduce bias from ambiguously “negative” outcomes . Our pip eline has t w o practical advantages : (1) it deriv es all training data from existing op en-source resources, sho wing that careful augmentation, filtering, and training metho d design can make mo dest op en data comp etitive with closed-data systems ; and (2) it avoids costly online environment interaction, making training scalable and accessible while strengthening the connection b etw een offline metrics and online task success. Extensiv e exp erimen ts across w eb and mobile b enchmarks sho w that the GUI-Libra series (3B–8B) consistently impro v es offline step-wise accuracy on standard offline b enc hmarks and b o osts online task completion on AndroidW orld ( Rawles et al. , 2025 ), W ebArena-Lite-v2 ( Liu et al. , 2025c ), and Online-Mind2W eb ( Xue et al. , 2025 ). Notably , GUI-Libra-4B and GUI-Libra-8B improv e their base mo dels by +15.6% and +12.2% on AndroidW orld, and +4.0% and +8.7% on Online-Mind2W eb. Detailed ablations further confirm the roles of action-a w are sup ervision and conserv ative regularization in mitigating grounding degradation, strengthening action prediction, and stabilizing learning under partially verifiable feedback. W e also analyze the impact of data filtering and explicit reasoning, and study the trade-off b etw een reasoning and grounding during RL training. W e hop e these findings and op en-source resources will encourage future work on data-efficient and reliable p ost-training for native GUI agents. Our main contributions are summarized as follo ws: • W e present GUI-Libra , a unified p ost-training framework for native GUI agen ts that tackles tw o key c hallenges: reasoning–grounding interference in SFT, addressed b y action-a ware SFT that emphasizes action and grounding tok ens; and weak offline-to-online predictability in RL, addressed by conserv ative optimization that constrains p olicy drift and impro ves offline–online alignment. • W e develop a scalable data construction and filtering pip eline and release a high-quality op en-source 81K GUI reasoning dataset with impro v ed action alignment. • W e achiev e consistent gains across represen tative offline and online web and mobile b enchmarks, showing that smaller native VLMs trained on mo dest op en-source data can matc h or even outp erform muc h larger systems. 2 Related Wo rk 2.1 Datasets fo r T raining GUI Agents Recen t progress in GUI agen ts has b een propelled b y a div erse ecosystem of datasets that target both visual p erception and task execution. F or robust visual grounding and screen parsing, datasets such as SeeClic k ( Cheng et al. , 2024b ), UGround ( Gou et al. , 2025 ), GUIAct ( Chen et al. , 2025c ), ScaleCUA ( Liu et al. , 2025c ), and GUI-360 ( Mu et al. , 2025 ) provide large corp ora of annotated screenshots and UI element 3 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL sup ervision ( Dek a et al. , 2017 ; Li et al. , 2020b , a ; Bai et al. , 2021 ; W u et al. , 2023 ; Y ang et al. , 2025a ; Zheng et al. , 2025b ; W u et al. , 2025b ; Nay ak et al. , 2025 ; Luo et al. , 2025 ). Mo ving b ey ond single-step grounding, sev eral large-scale con text-a w are and tra jectory-based datase ts capture m ulti-step interactions in realistic environmen ts, enabling models to learn how UI state evolv es ov er time. Examples include AITW ( Rawles et al. , 2023 ), MM-Mind2W eb ( Zheng et al. , 2024 ; Deng et al. , 2023 ), AMEX ( Chai et al. , 2025 ), GUI Odyssey ( Lu et al. , 2024 ), and Aria-UI ( Y ang et al. , 2024c ). In addition, datasets suc h as AndroidControl ( Li et al. , 2024 ) and JEDI ( Xie et al. , 2025 ) enrich interaction tra jectories with lo w-level action descriptions, helping bridge high-level inten t with executable op erations and improving the learnability of fine-grained GUI manipulation p olicies. T o train agen ts to understand not only what actions to take but also why , recent efforts introduce natural- language rationales that explicitly inject observ ation interpretation and planning into step-by-step decision making AITZ ( Zhang et al. , 2024 ), AgentT rec k ( Xu et al. , 2025a ), OS-Genesis ( Sun et al. , 2024 ), Aguvis ( Xu et al. , 2025c ), GUI-Net-1M ( Zhang et al. , 2025a ), and W ebST AR ( He et al. , 2025 ). Despite their promise, such reasoning annotations are often short and noisy , limiting their effectiveness in reliably teaching long-horizon reasoning, error recov ery , and strategy adaptation. Agen tNet ( W ang et al. , 2025d ) tak es a step further by syn thesizing more detailed reasoning traces that include reflective thoughts, enabling agents to detect mistak es and reco ver mid-tra jectory . How ever, AgentNet primarily fo cuses on desktop environmen ts, and high-quality reasoning-ric h data for mobile and web scenarios remains scarce, lea ving op en challenges for training robust, general-purp ose GUI agents across platforms. 2.2 VLM P ost-training for GUI Agents Recen t adv ances in GUI agents ha ve b een largely driven by p ost-training VLMs to align natural-language instructions with actionable UI interactions. Many GUI grounding-orien ted metho ds primarily rely on SFT with curated interaction or annotation data, including representativ e efforts such as SeeClick ( Cheng et al. , 2024b ), OS-Atlas ( W u et al. , 2025b ), Aria-UI ( Y ang et al. , 2024c ), and JEDI ( Xie et al. , 2025 ). Bey ond text-based co ordinate prediction, GUI-Actor ( W u et al. , 2025a ) applies an explicit attention mechanism to impro v e the generalization to out-of-distribution screenshots. Instead of imitation-style learning, a growing b o dy of work explores to impro ve grounding accuracy and robustness via reinforcement learning, including UI-R1 ( Lu et al. , 2025 ), GUI-R1 ( Luo et al. , 2025 ), GUI-G1 ( Zhou et al. , 2025b ), GUI-G2 ( T ang et al. , 2025 ), GT A1 ( Y ang et al. , 2025b ), and InfiGUI-G1 ( Liu et al. , 2025b ). Hybrid pip elines that com bine SFT+RL further push p erformance by leveraging high-quality demonstrations for initialization and RL for p olicy refinemen t, such as Phi-Ground ( Zhang et al. , 2025c ) and UI-Ins ( Chen et al. , 2025b ). More recen tly , research increasingly targets unified native GUI mo dels that jointly learn grounding, planning, and m ulti-step na vigation in an end-to-end manner. Several works adopt SFT-only training on mixed tra jectory data to obtain strong generalist computer-use mo dels, including CogAgent ( Hong et al. , 2023 ), Aguvis ( Xu et al. , 2025c ), ScaleCUA ( Liu et al. , 2025c ), F ARA ( Awadallah et al. , 2025 ), and Op enCUA ( W ang et al. , 2025d ). T o further equip agents with the ability to explore diverse strategies and improv e long-horizon success via trial-and-error, other efforts incorp orate RL-based p ost-training for b etter p olicy optimization and stability , such as DigiRL ( Bai et al. , 2024b ), AutoGLM ( Liu et al. , 2024 ), UI-T ARS ( Qin et al. , 2025b ; W ang et al. , 2025a ), MAI-UI ( Zhou et al. , 2025a ), UI-V enus ( Gu et al. , 2025 ), F erret-UI-Lite ( Y ang et al. , 2025c ), and W ebGym ( Bai et al. , 2026 ). Despite these adv ances, three limitations remain. First, encouraging free-form reasoning during RL can h urt grounding accuracy ( Zhou et al. , 2025b ; Y ang et al. , 2025b ; T ang et al. , 2025 ; Lu et al. , 2025 ; Chen et al. , 2025b ), as mo dels may prioritize high-level semantic logic o v er precise spatial execution. Second, online RL ( W ang et al. , 2025a ; Zhou et al. , 2025a ; Bai et al. , 2026 ) is exp ensive to scale, requiring costly environmen t interaction and robust infrastructure. Third, step-wise RL VR-st yle training ( Y ang et al. , 2025c ; Hong et al. , 2025 ) in GUI settings faces partial verifiabilit y , which makes rew ards am biguous and introduces noisy or biased learning signals. In this pap er, we fo cus on systematically understanding VLM p ost-training for GUI agents, and contribute op en-source training recip es together with a high-quality , op enly released dataset, enabling repro ducible dev elopmen t of GUI agents with enhanced reasoning capability . 4 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL 3 Prelimina ries VLM-based GUI agents. W e formulate GUI in teraction as a goal-conditioned partially observ able Marko v decision pro cess (POMDP) with a natural-language instruction space L , sp ecified by ( S , A , O , T , R , γ ). Here, S denotes laten t environmen t states (e.g., application/page context and UI lay out), and A is an action space where eac h action consists of an op eration and its arguments (e.g., click ( x, y ), type ( text ), scroll ( direction )). State transitions follow T ( s t +1 | s t , a t ). At the b eginning of each episo de, an instruction ℓ ∈ L sp ecifies the task goal. A t each step, the agent receiv es a partial observ ation o t ∈ O deriv ed from the underlying state s t , typically a screenshot. Due to partial observ abilit y , the agent conditions on the interaction history h t = ( o 0 , a 0 , . . . , o t − 1 , a t − 1 ) together with the current observ ation o t , and follows a VLM-parameterized p olicy π θ ( a t | ℓ, h t , o t ). The goal-conditioned reward is r t = R ( s t , a t ; ℓ ), which is often sparse and success-only: r t = 1 if the agent achiev es the goal sp ecified by ℓ (t ypically at termination), and r t = 0 otherwise. The episo de terminates up on success or after a maximum horizon T . The ob jective is to maximize the exp ected return, max π θ E h P T t =0 γ t r t i . When γ = 1, this ob jectiv e is equiv alent to maximizing expected task success. High-level vs. low-level GUI tasks. W e categorize GUI tasks by their temp oral abstraction. A low-level task can b e completed with a single atomic interaction, such as “t yp e amazon.com in the address bar” or “click the confirm button.” In contrast, a high-level task requires a multi-step interaction tra jectory , such as “buy a mac hine learning textbo ok on Amazon,” whic h induces a sequence of low-lev el actions across multiple screens. W e view gr ounding as a sp ecial case of low-lev el decision making, where the agent lo calizes the target UI elemen t b y predicting its interaction co ordinates from the instruction and the current observ ation ( ℓ, o t ). In our pap er, we mainly fo cus on high-lev el na vigation tasks. P ost-training fo r GUI Mo dels. Sup ervise d fine-tuning (SFT) is a standard approach for p ost-training GUI mo d- els. W e assume a dataset of expert tra jectories D = { τ i } N i =1 , where eac h tra jectory is τ i = { ( ℓ i , h i t , o i t , c i t , a i t ) } T i − 1 t =0 . A t step t , the mo del conditions on the context x i t = ( ℓ i , h i t , o i t ) and outputs a reasoning trace c i t follo w ed by an executable action a i t . W e concatenate reasoning and action into a single target sequence y i t = [ c i t ; a i t ] and minimize the negative log-likelihoo d: L SFT ( θ ) = − E ( x t ,y t ) ∼ D log π θ ( y t | x t ) , (1) where the exp ectation is taken ov er all tra jectories and time steps in D . Bey ond SFT, prior w ork ( Luo et al. , 2025 ; Lu et al. , 2025 ; Zhou et al. , 2025b ; Y ang et al. , 2025b ) adopts reinforcemen t learning from v erifiable rewards (RL VR) to directly optimize step-wise action/coordinate correctness. Given step contexts x = ( ℓ, h t , o t ) ∼ D , we sample a group of G candidate actions { a k } G k =1 ∼ π θ old ( · | x ) and compute rew ards r k = R ( x, a k ), where R is a weigh ted com bination of rule-based matching scores on action t yp e, v alue (text), and co ordinates. GRPO then up dates the p olicy using a group-relative v arian t of p olicy gradient: L GRPO ( θ ) = − E x ∼ D, { a k } G k =1 ∼ π θ old ( ·| x ) " 1 G G X k =1 min ρ k ˆ A k , clip( ρ k , 1 − ϵ, 1 + ϵ ) ˆ A k − β · KL( π θ ( · | x ) ∥ π ref ( · | x )) # , (2) where ρ k = π θ ( a k | x ) /π θ old ( a k | x ) and ˆ A k = ( r k − µ r ) / ( σ r + δ ) is the group-normalized adv antage. Here, µ r and σ r are the mean and standard deviation of { r k } G k =1 , and δ is a small constant for numerical stability . The co efficien t β con trols KL regularization tow ard a reference p olicy π ref . In practice, recent RL VR-st yle wo rk often removes the explicit KL term (i.e., sets β = 0 ) ( Y u et al. , 2025 ; Liu et al. , 2025d ; Zhou et al. , 2025b ; Y ang et al. , 2025b ). While RL VR provides conv enient automatic sup ervision, step-wise rewards can b e am biguous for high-level GUI tasks: in the same state, m ultiple distinct actions may b e v alid and still make progress, making rule-based matc hing an imp erfect proxy for correctness. W e analyze this discrepancy in Section 5.3 . 4 Reasoning Data Curation fo r GUI Agents T o address the scarcity of high-quality reasoning data for GUI agents, w e develop an automated pip eline to construct and filter a high-quality reasoning dataset, GUI-Libra-81K . 5 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 2 Comparison of our GUI reasoning dataset with previous op en-source datasets in w eb and mobile domains. T oken statistics are computed using the Qw en2.5-VL-3B-Instruct tokenizer. Dataset Avg Thought T oken Per Step T otal Steps #T ra j MM-Mind2W eb ( Zheng et al. , 2024 ) 0 8K 1K AndroidControl ( Li et al. , 2024 ) 11 75K 14K GUIAct ( Chen et al. , 2025c ) 0 17K 2.5k AMEX ( Chai et al. , 2025 ) 0 35K 3K GUI-Net-1M ( Zhang et al. , 2025a ) 37 4M 1M ScaleCUA ( Liu et al. , 2025c ) 0 170K 19K AGUVIS ( Xu et al. , 2025c ) Stage2 L2 (All sources) 56 300K 35K AGUVIS ( Xu et al. , 2025c ) Stage 2 L3 (All sources) 85 300K 35K GUI-Libra-81K (Ours) 210 81K 9K Visual Obs. System Pr ompt: You are a GUI age nt. You a re given a task a nd a screenshot of the scree n. You need to pe rform a series of actions to c omplete.. User Prompt: Ple ase generate the next move according to the UI… Prompt On the " Group Trave l for Student s" page, I see a form designed for entering group booking de tails. My goal is to find Rit z - Carlton hotel s in Washington DC that c an accom modate a stud ent soci al eve nt from M ay 9 to May 12 , requiring 7 guest r ooms … . <\ think> Instruction: Find The Ri tz - Carlton hot els in Washingt on DC t hat c an hold a st udent s ocial e vent from May 9 t o May 12 and will require 7 guest .. Inter action History: Step 1 Action : Click on the ‘Meetings & Events’ l ink in the… Step 2 Action : Click on the ‘Group T ravel ’ link in the left subme nu to l ook… Response Reasoni ng 1 { " action_ description ": " Enter 'W ashing ton DC' into th e destinati on inp ut field to spec ify the cit y for the eve nt .", " action_ type ": "W rite", " action_ target ": " Destination input field labeled 'Where do you want to hold your e vent ?'", " value ": " Washington DC", " point_2d ": [240, 5 86] } <\ answer> Action 2 Figure 2 Example data format in GUI-Libra-81K . Each sample includes the curren t visual observ ation (screenshot) and textual context (system prompt, user instruction, and in teraction history/previous actions). The mo del output is split into (1) a CoT reasoning trace and (2) a structured executable action (JSON), sp ecifying the action type, a brief action description, the target elemen t (if a v ailable), and action arguments such as text v alues or coordinates. 4.1 Data Curation and Filtering Pip eline As shown in T able 2 , existing op en-source datasets in w eb and mobile domains (e.g., the A GUVIS collection ( Xu et al. , 2025c )) typically provide only short rationales, often fewer than 100 though t tokens p er step. Rather than collecting costly new data from online en vironments, we aim to fully leverage the large volume of existing w eb and mobile data by augmenting it with richer CoT reasoning and filtering out low-qualit y samples that would otherwise induce reasoning–action mismatch. 4.1.1 Data Sources Although large-scale op en-source datasets exist for GUI grounding ( Gou et al. , 2025 ; W u et al. , 2025b ), tra jectory-based GUI navigation data remain relatively scarce due to the high cost of collecting m ulti-step in teraction traces. F ollo wing A GUVIS ( Xu et al. , 2025c ), w e therefore aggregate tra jectory data from m ultiple public sources that cov er b oth web and mobile domains, including GUI-Odyssey ( Lu et al. , 2024 ), AMEX ( Chai et al. , 2025 ), AndroidControl ( Li et al. , 2024 ), AitZ ( Zhang et al. , 2024 ), AitW ( Rawles et al. , 2023 ), GUIAct ( Chen et al. , 2025c ), and MM-Mind2W eb ( Zheng et al. , 2024 ). Compared with the original A GUVIS collection, we additionally include the Chinese subset from GUIAct to broaden multilingual cov erage and increase website diversit y . Ov erall, these datasets span diverse applications and websites, and include 6 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL (c) (d) (b) (a) SFT Dat a Sour ce D istrib ution RL Data Sourc e Dis tribution Figure 3 (a)(b) Data source distribution for SFT and RL. (c) Action type distribution of GUI-Libra-81K . (d) Comparison of step index distributions betw een our SFT and RL datasets. tasks with v arying diffic ult y levels. W e apply an initial cleaning stage to remov e incomplete tra jectories, extremely short or long traces (few er than 3 steps or more than 50 steps), and steps containing comp ound actions that cannot b e represen ted in our action space. After cleaning, we obtain 19K tra jectories comprising 170K steps. 4.1.2 Unified Structured Fo rmat Figure 2 summarizes our unified data format. Each sample contains an input and an output. The input includes a system prompt that enumerates a v ailable actions, the user instruction, the interaction history (previous actions), and the current screenshot. The output contains (1) a reasoning trace enclosed by . . . and (2) a structured action enclosed by . . . . The structured action is represen ted as a JSON ob ject with an action type and corresp onding arguments, such as value for text entry and point 2d for clic k co ordinates. W e consider 13 common action types for web and mobile con trol: Click , Write , Terminate , Swipe , Scroll , NavigateHome , Answer , Wait , OpenAPP , NavigateBack , KeyboardPress , LongPress , and Select . The action optionally includes an action target , a natural-language description of the UI element to in teract with. In addition, we include an action description that succinctly states the in tended op eration, which is app ended to the interaction history for subsequent steps and captures a brief step-lev el rationale. Retaining action target further supp orts do wnstream filtering b y enabling consistency c hec ks b etw een the describ ed target element and the co ordinates from the original datasets. Details of the action space are provided in App endix B . 4.1.3 Action-aligned Reasoning Augmentation Most existing GUI tra jectory datasets lack detailed reasoning traces or only include short rationales. AGUVIS ( Xu et al. , 2025c ) augments tra jectories b y prompting GPT-4o with the instruction, previous actions, and the current action, then requesting a brief “though t” and a one-sentence action description. W e find t wo factors that limit the quality of such generated reasoning. First, the prompt is not sufficien tly informativ e. W e extend it with GUI-sp ecific guidelines that encourage structured reasoning (observ ation description, reflection, and planning), enforce format constraints, and add action-related requirements. Second, reasoning quality is sensitiv e to the choice of generator mo del. W e compare reasoning traces pro duced b y GPT-4o, o4-mini, and GPT-4.1 and observe substantial differences across mo dels (Figure 13 ). Our structured output format also facilitates reliable parsing and downstream pro cessing. Moreo ver, w e do not force the generator to exactly follo w the dataset action; instead, w e treat the annotated action as a reference and allow the mo del to select a differen t action from the av ailable set when it has sufficien t justification. The full prompt template is pro vided in App endix F . A t each step, the generator pro duces the reasoning trace, action description , action type , action - target , and value , while we reuse the co ordinates from original dataset as point 2d . Because generation is conditioned on noisy original annotations, mismatches can still arise, for example, the dataset may contain incorrect co ordinates, the generated action target ma y not match the provided co ordinates, or the mo del ma y choose a differen t action than the annotation. W e therefore add a dedicated filtering stage to improv e action–reasoning alignment and ov erall data qualit y . 7 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL 4.1.4 Data Filtering for SFT Human-collected and automatically lab eled GUI tra jectories are inevitably noisy ( Y ang et al. , 2025b ; Xu et al. , 2025c ), including incorrect action t yp es and inaccurate co ordinates. T o improv e data qualit y , we curate our SFT data using a t w o-step automatic filtering pip eline. (1) Agreement filtering via action re-prediction. W e run Qw en3-VL-8B-Instruct for 10 sto chastic runs on each input and measure how often the predicted action matches the annotation (e.g., exact match on action type and co ordinate proximit y for Click -lik e actions). W e discard steps with re-prediction accuracy b elow 0 . 3, whic h effectively remov es uncertain or lo w-qualit y sam ples. (2) Co ordinate alignment via bounding-b ox verification. W e leverage action target to v erify whether the original co ordinate actually corresp onds to the intended UI elemen t, and to obtain b ounding-b ox sup ervision for RL. Concretely , we prompt Qw en3-VL-32B-Instruct to predict a b ounding b ox given the current screenshot and action target , and keep a step only if the original point 2d falls inside the predicted b ox. This filtering remo v es co ordinate errors and reduces reasoning–action mismatch, while also pro viding reliable b ounding-b ox annotations, often missing from prior datasets ( Xu et al. , 2025c ), for subsequen t RL training. SFT Dataset Statistics. After filtering, we obtain 81K SFT steps originating from 9K tra jectories. Figure 3 (a) sho ws the source distribution: most data comes from mobile datasets (e.g., AndroidCon trol, GUI-Odyssey , AMEX), while only 14.3% comes from the web domain. This reflects the current ecosystem where large-scale op en mobile in teraction data are more prev alen t; scaling high-quality web tra jectories remains an imp ortant direction. Figure 3 (b) shows the action distribution: Click accoun ts for around 60% of steps, follow ed by Write , Terminate , and Swipe , while LongPress and Select are rare. This im balance makes rare actions difficult to learn from SFT alone, motiv ating a subsequent RL stage. 4.1.5 Data Filtering for RL F or RL, we prioritize a more balanced training set b y reducing biases in both step index and domain. Sp ecifically , we address tw o issues: (i) ea rly-step bias , where many tra jectories share similar initial screens and actions (e.g., mobile home screens or common web landing pages), and (ii) domain imbalance , where mobile tra jectories dominate the p o ol. T o mitigate these effects, we downsample early steps (small step indices) and further do wnsample mobile-domain tra jectories, resulting in a 40K -step dataset for RL training. Figure 3 (c) compares the step-index distributions, while Figure 3 (a) compares the domain distributions. Ov erall, the RL subset is substantially more balanced than the SFT dataset. 5 GUI-Lib ra In this section, w e introduce GUI-Libra, a native GUI agent with enhanced reasoning capabilities. Using our curated dataset, w e first conduct a systematic study of SFT with long CoT and its impact on grounding. W e then analyze how step-wise RL VR-st yle training correlates with online p erformance in GUI navigation. 5.1 SFT with Long CoT Hurts GUI Grounding Prior work ( Lu et al. , 2025 ; Luo et al. , 2025 ) has observed that remo ving CoT reasoning can improv e grounding p erformance in GUI agen ts. How ever, systematic evidence and analysis of this effect, especially under long CoT traces, remain limited, since most existing training data contain only short rationales. Using GUI-Libra-81K , w e inv estigate ho w resp onse length correlates with grounding performance on ScreenSp ot-v2 ( W u et al. , 2025b ). Specifically , we fine-tune Qwen2.5-VL-3/7B-Instruct base mo dels and prompt them to generate reasoning b efore grounding, following our structured resp onse format. T o induce div erse resp onse lengths, w e sample outputs with a temp erature of 1.0. W e group resp onses into 30-tok en bins and discard bins with fewer than 20 samples for statistical reliability . As shown in Figure 4 (a), grounding accuracy exhibits a clear negative correlation with resp onse length for b oth base and CoT-SFT mo dels. Longer responses consistently lead to worse grounding p erformance. Moreo ver, CoT-based SFT substantially widens the length distribution, pro ducing many resp onses longer than 250 tok ens, whic h are asso ciated with particularly severe p erformance drops. 8 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL (a) (b) Figure 4 (a) Grounding accuracy on ScreenSp ot-v2 versus response length for base mo dels and CoT-SFT mo dels, sho wing that o verly long resp onses correlate with degraded grounding. (b) Average grounding accuracy under different SFT strategies, where excessiv ely long reasoning traces lead to a substan tial drop. T o pinp oint the source of this degradation, we compare three SFT v ariants on Qwen2.5-VL-3B-Instruct using GUI-Libra-81K : (i) SFT with CoT , which uses the full reasoning-then-action outputs; (ii) SFT without CoT , whic h remo ves reasoning and keeps only the ... action; and (iii) Gr ounding-only , whic h predicts co ordinates solely from the action-target description. Figure 4 (b) shows that grounding-only SFT yields a mo dest gain, while SFT without CoT slightly degrades p erformance. In contrast, SFT with long CoT traces causes a substantial drop, indicating that the primary driver of grounding degradation is excessiv ely long reasoning sequences. 5.2 Action-Aw are Sup ervised Fine-T uning Our goal is to build native GUI agen ts that can b oth reason and act within a single mo del. Rather than discarding reasoning traces, we seek to preserve reasoning ability while mitigating the grounding degradation caused by long CoT sequences. T o this end, w e prop ose action-awar e sup ervise d fine-tuning (ASFT), a unified training framew ork that combines mixed data sup ervision with tok en-lev el reweigh ting to balance reasoning, action prediction, and grounding. Mixed reasoning and direct-action sup ervision. ASFT trains on a mixture of data with and without explicit reasoning traces. T o construct the direct-action data, we remov e the reasoning traces and keep only the structured action output b etw een and . T raining on b oth data v ariants pro vides t w o complemen tary sup ervision mo des: (i) reasoning-then-action and (ii) direct action prediction. Similar data mixtures were used for GUI mo dels such as Op enCUA ( W ang et al. , 2025d ), as well as LLM agent training ( W ang et al. , 2025b ; Zhang et al. , 2025b ), but their role in mitigating grounding degradation has not b een clearly demonstrated. This dual-mo de sup ervision serves tw o purp oses. First, it increases the amount of action-cen tric learning signal, strengthening action prediction that is essential for interactiv e agents. Second, it reduces reliance on verbose in termediate reasoning, alleviating grounding degradation induced by long CoT traces. As a result, the mo del can flexibly pro duce either concise direct actions or reasoning-then-action outputs at inference time, improving b oth grounding accuracy and resp onse efficiency . Action-a w a re reweighting. In addition to mixed sup ervision, ASFT further assigns higher weigh ts to action and grounding tok ens. Although grounding is part of the action output in our formulation, the action sequence con tains b oth seman tic comp onen ts (e.g., action description, action type, and v alue) and spatial comp onents (co ordinates), whic h can b e weigh ted differen tly at the token level. Concretely , we treat tok ens inside ... as the action output, and further split them into action tokens (all tok ens excluding the point 2d field) and gr ounding tokens (tokens asso ciated with point 2d field). Let c t , a t , and g t denote the reasoning, action, and grounding tokens at step t , resp ectively , and let 9 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL x t = ( ℓ, h t , o t ) denote the conditioning context. W e denote the mixed training set by D mix , which contains b oth reasoning-then-action and direct-action samples; for direct-action samples we set c t to an empt y sequence. Under this unified representation, the ASFT ob jectiv e is L ASFT ( θ ) = − E ( x t ,c t ,a t ,g t ) ∼ D mix log π θ ( c t | x t ) + α a log π θ ( a t | x t , c t ) + α g log π θ ( g t | x t , c t , a t ) | c t | + α a | a t | + α g | g t | , (3) where α a and α g con trol the relativ e importance of action and grounding tok ens. By adjusting these co efficients, ASFT reco vers sev eral common training strategies as sp ecial cases: (i) α a = α g = 1 reduces to standard SFT; (ii) α a = α g ≫ 1 emphasizes action/grounding tok ens and approximates CoT-free SFT; and (iii) α g ≫ α a with α g ≫ 1 further reduces to grounding-only SFT. Overall, ASFT pro vides a flexible mec hanism for balancing reasoning, action, and grounding during sup ervised fine-tuning of native GUI agents. 5.3 Reinfo rcement Lea rning from P a rtial V erifiable Rew ards Applying RL VR-st yle training to optimize step-wise action or co ordinate correctness for GUI agents has b een explored in prior w ork ( Luo et al. , 2025 ; Lu et al. , 2025 ; Zhou et al. , 2025b ; Y ang et al. , 2025b ). Ho wev er, m ulti-step GUI navigation differs from standard RL VR settings in tw o key wa ys. (i) Err ors ac cumulate and induc e distribution shift : small step mistak es comp ound ov er time and change the distribution of states the agen t visits. (ii) R ewar ds ar e p artial ly verifiable : at each step, m ultiple actions can correctly adv ance the task, y et offline sup ervision typically provides and v e rifies only a single demonstrated action. W e show that b oth factors are crucial for understanding when offline step-wise metrics can reliably predict online task success. Setup. F or ease of analysis, we adopt a finite-horizon MDP that is consistent with our earlier goal-conditioned POMDP form ulation. Sp ecifically , we consider a goal-conditioned MDP M = ( S , A , P , H ), where the instruction ℓ is included in the state, and partial observ ability is handled by treating the agen t’s history (or b elief state) as the effective state. A p olicy π induces a sequence of state visitation distributions { d π ,t } H t =1 , where d π ,t ( s ) is the probability of visiting state s at step t . F or each state s , let A ∗ ( s ) ⊆ A denote the set of valid actions that can correctly adv ance the task. Offline sup ervision provides only a single demonstrated action ˜ a ( s ) ∈ A ∗ ( s ). Definition 5.1: P artially V erifiable Reward F or each state s , the offline dataset provides a single demonstrated action ˜ a ( s ) ∈ A ∗ ( s ). The step-wise rew ard induced by offline verification is ˜ r ( s, a ) ≜ 1 { a = ˜ a ( s ) } . W e call ˜ r p artial ly verifiable if ˜ r ( s, a ) = 1 ⇒ a ∈ A ∗ ( s ) but ˜ r ( s, a ) = 0 ⇒ a / ∈ A ∗ ( s ) , i.e., p ositive feedback is reliable, while negative feedback is ambiguous b ecause A ∗ ( s ) often con tains v alid actions b eyond ˜ a ( s ). Offline vs. online metrics. Offline ev aluation typically measures one-step action matching on a fixed state distribution d µ , the marginal induced by an exp ert dataset D µ . W e define the offline score as M off ( π ) ≜ E s ∼ d µ π (˜ a ( s ) | s ) = E ( s, ˜ a ) ∼ D µ E a ∼ π ( ·| s ) 1 { a = ˜ a } . (4) Online ev aluation measures tra jectory-level task success under closed-loop in teraction. W e define the probabilit y that π completes the task within horizon H as: J ( π ) ≜ Pr τ ∼ π ,P success( τ ) = 1 . Two quantities c ontr ol ling pr e dictability. W e characterize when S off ( π ) is predictive of J ( π ) using tw o factors: (i) o c cup ancy mismatch b etw een the online state distribution induced by π and the offline distribution d µ , 10 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL and (ii) step-wise ambiguity due to partial verifiabilit y . F ormally , define the o ccupancy mismatch co efficient C ( π ) ≜ max t ∈ [ H ] sup s : d µ ( s ) > 0 d π ,t ( s ) d µ ( s ) , (5) and define the off-demo validity mass of π at state s as η π ( s ) ≜ π ( A ∗ ( s ) \ { ˜ a ( s ) } | s ) , ¯ η π ≜ E s ∼ d µ [ η π ( s )] . (6) Note that the true step-wise v alidit y probability satisfies π ( A ∗ ( s ) | s ) = π ( ˜ a ( s ) | s ) + η π ( s ) , since A ∗ ( s ) ma y contain v alid actions b eyond the single demonstrated action ˜ a ( s ), whic h are not credited by offline matching. Assumption 5.1. If an episo de fails under p olicy π , then there exists at least one step t ≤ H suc h that a t / ∈ A ∗ ( s t ). Theorem 5.1: Offline-to-online b ound under partial verifiabilit y Assume Assumption 5.1 and that, for all t ∈ [ H ], d π ,t ( s ) > 0 implies d µ ( s ) > 0 (i.e., supp ( d π ,t ) ⊆ supp ( d µ )). This condition ensures the o ccupancy ratio C ( π ) is well-defined. Then the online success probabilit y satisfies J ( π ) ≥ 1 − H · C ( π ) · 1 − M off ( π ) − ¯ η π . (7) In particular, if C ( π ) is uniformly b ounded ov er a p olicy class and ¯ η π is small or stable acr oss p olicies , then M off ( π ) b ecomes predictiv e of J ( π ) through the affine low er b ound in Eq 7 . T ake away. Theorem 5.1 shows that offline-to-online predictability is gov erned by tw o factors: (1) distribution shift captured by C ( π ) and (2) non-identifiability under p artial verifiability captured b y the unobserv ed off-demo v alidit y mass ¯ η π . Th us, offline one-step matc hing can b e a p o or proxy for online success when either the p olicy drifts to states outside the offline supp ort or the probability mass o ver v alid actions shifts from the demonstrated action to other v alid alternatives, changing M off ( π ) without reflecting true step v alidity . A detailed pro of and discussion are deferred to App endix E . 5.3.1 Why Standard RL VR is Easier to Predict? Corollary 5.1: F ully verifiable, single-step RL VR Supp ose H = 1 and the reward is fully verifiable, i.e., A ∗ ( s ) = { ˜ a ( s ) } for all s (hence η π ( s ) ≡ 0). More generally , for H = 1, Eq 7 reduces to J ( π ) ≥ 1 − C ( π ) 1 − M off ( π ) , indicating substantially tighter offline-to-online alignment. Standard RL VR is often easier to analyze and predict b ecause it is typically single-step ( H = 1) and fully v erifiable ( η π ≡ 0): one-step matching directly reflects true correctness and there is no error accum ulation o v er time. In contrast, for m ulti-step GUI agen ts, distribution shift across steps and partial verifiabilit y jointly w eak en the link b etw een offline matching and online success. This motiv ates metho ds that explicitly address b oth state-distribution shift and reward ambiguit y in RL for long-horizon GUI na vigation. 5.3.2 KL Regularization Imp roves Predictabilit y While many RL VR pip elines omit KL regularization for efficiency ( Y u et al. , 2025 ; Liu et al. , 2025d ; Zhou et al. , 2025b ; Y ang et al. , 2025b ), we find it is crucial in partially v erifiable, multi-step GUI settings. In tuitively , a KL 11 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL trust region constrains p olicy drift, which in turn helps control the tw o quantities gov erning offline-to-online predictabilit y in Theorem 5.1 : the o ccupancy mismatch C ( π ) and the off-demo v alidity mass ¯ η π . KL-induc e d b ounds for o c cup ancy mismatch and off-demo validity mass (informal). Let π ref b e a refer- ence policy (e.g., the SFT initialization trained on demonstrations) and assume a p er-state KL constraint KL( π ( · | s ) ∥ π ref ( · | s )) ≤ ε for all s . Under this constraint, the state visitation distribution induced by π cannot drift to o far from that of π ref . In particular, if the offline distribution has a positive low er b ound on its supp ort, ρ ≜ inf s : d µ ( s ) > 0 d µ ( s ) > 0, and d π ref ,t ≪ d µ for all t ∈ [ H ], then the o ccupancy mismatch is con trolled as C ( π ) ≤ C ( π ref ) + H √ 2 ε ρ . (8) KL regularization also limits how muc h probability mass can mov e aw a y from the demonstrated action. If the reference p olicy is demo-concentrated, i.e., π ref (˜ a ( s ) | s ) ≥ 1 − δ ( s ), then the off-demo v alidity mass satisfies ¯ η π ≤ ¯ δ + p ε/ 2 , ¯ δ ≜ E s ∼ d µ [ δ ( s )] . (9) F ull statements and pro ofs are provided in Appendix E.0.2 . The ab ov e b ounds pro vide a principled explanation for why KL-regularized p olicy optimization improv es predictabilit y: a KL trust region simultaneously limits state-distribution shift and constrains ho w m uc h probability mass can mo v e aw ay from the demonstrated action. As a result, KL-regularized RL keeps training in a regime where the offline matching score M off ( π ) remains a more stable proxy for the online success rate J ( π ). 5.4 Success-adaptive Negative Gradient Scaling Under partial verifiabilit y , the step-wise rew ard provides reliable p ositive feedback, while ˜ r ( s, a ) = 0 is am biguous: it conflates truly inv alid actions with v alid-but-uncredited alternatives. Consequently , treating ev ery non-matc h as equally negative can pro duce biased and ov erly aggressive up dates, pushing the p olicy to o v erfit the demonstrator’s particular c hoice. T o address this issue, w e propose suc c ess-adaptive ne gative gr adient sc aling (SNGS) , which conserv atively down weigh ts gradients induced by ambiguous “negative” outcomes. Imp ortan tly , negativ e up dates in p olicy-gradient metho ds such as GRPO remain useful for stabilizing training and a voiding premature collapse ( Zhu et al. , 2025 ). Therefore, rather than suppressing all negativ e gradien ts equally , SNGS rescales them using a state-conditioned reliability signal estimated from the GRPO sampling group. Concretely , GRPO samples a group of G candidate actions for the same state and computes group- relativ e adv antages. Let { ( a k , ˜ r k ) } G k =1 denote a GRPO group at state s , where ˜ r k = 1 { a k = ˜ a ( s ) } ∈ { 0 , 1 } . W e define the empirical group success rate as ˆ p g ( s ) ≜ 1 G P G k =1 ˜ r k , which measures how concentrated the curren t p olicy is on the demonstrated action. W e introduce a scaling factor λ g ( s ) that rescales only the ne gative adv antages. λ g ( s ) ≜ min ( λ 0 + κ ˆ p g ( s ) , 1) . (10) Here λ 0 is an offset, and κ con trols how λ g v aries with ˆ p g . With κ > 0, λ g increases with ˆ p g : as the p olicy b ecomes more concentrated on ˜ a ( s ), non-matching samples are more likely to b e genuinely incorrect, so we do wn weigh t negative gradients less and gradually recov er the standard GRPO up date as λ g → 1. With κ < 0, λ g decreases with ˆ p g , making up dates more conserv ative for high-success groups. In our experiments, w e find κ > 0 works well in most settings and use it as the default. Let A k denote the GRPO adv antage for sample k . SNGS mo difies only the negative adv an tages: ˜ A k ≜ ( A k , A k ≥ 0 , λ g ( s ) A k , A k < 0 . (11) W e then replace the adv an tage term in the GRPO ob jective (Eq. 2 ) with ˜ A k . As a result, SNGS preserv es p ositiv e learning signals corresp onding to reliably verified matches, while attenuating up dates driv en by p oten tially ambiguous negatives. This reduces ov er-p enalization of v alid alternativ es and leads to more robust p olicy optimization under partial verification. 12 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL VLM Group Advantage Computation GRPO Loss Partially - Ver ifiable Re ward Calcul ation KL Loss Pol i c y G ra d i en t L o s s 2. Tra i n i n g Stage2 : Conser vative RL VLM We i g h t e d Cr o s s E n t r o p y L o s s 1. Reasoning + Action Data 2. Direct Action Data + Reasoning Tokens Grounding Token s Action Tok ens Prompt Token s … … Visual Obs. Prompt Response You are a GUI a gent… Please ge nerate t he next… I ob served th at I have already typed the … {action type : click; … point_2d: [99 3, 257] …} Visual Obs. Prompt Response You are a GUI a gent… Please ge nerate t he next… No reasoning {action type : click; … point_2d: [99 3, 257] …} Visual Obs. Prompt Reference Answer You are a GUI a gent… Please ge nerate t he next… {action type : click; … point_ 2d: [993, 257] …} Interaction History : Step 1 : … Success -adaptive Negative Gradient Scaling Group - wise Success Ratio 1. Mixed D ataset Stage1: Acti on - aware SFT ("𝜶 𝒂 ≥ 𝜶 𝒈 ≥ 𝜶 𝒓 = 𝟏) 𝜶 𝒈 𝜶 𝒂 𝜶 𝒓 =1 Figure 5 Overall training framework of GUI-Libra: Stage 1 applies action-aw are SFT with mixed supervision and tok en rew eighting; Stage 2 p erforms KL-regularized GRPO with success-adaptive negativ e gradient scaling. 5.5 Rew ard Function Implementation Eac h rollout pro duces a structured prediction string y following our output structure: y = · · · a , where the blo ck contains a structured action a that can b e parsed in to a JSON ob ject a = { action type , action description , value , point 2d } , with point 2d ∈ R 2 (or "none" when not applicable). W e implement tw o automated verifiers: a format verifier and an ac cur acy verifier . The resulting step-wise reward is a weigh ted sum of the t wo: ˜ r ( s, a ) = w fmt r fmt + (1 − w fmt ) r acc , w fmt ∈ [0 , 1] , (12) where r fmt c hec ks output v alidity and r acc scores action correctness. W e set w fmt = 0 . 1 so that our reward mainly fo cus on action correctness. F o rmat rewa rd. The format reward r fmt is 1 if the output contains v alid and tags and the blo c k can b e parsed in to the required JSON schema; otherwise r fmt = 0. Accuracy rew a rd. The accuracy reward r acc ev aluates semantic correctness of the predicted action: r acc = r act · r v al · r g , where eac h comp onent is computed as follows: (1) Action-type rewa rd r act c hec ks whether action type matc hes the demonstrated action t yp e. (2) Value rewa rd r v al compares the predicted value v with the demonstrated v alue v ⋆ using word-lev el F1, and sets r v al = 1 if F1 ( v , v ⋆ ) > 0 . 5. (3) Grounding rew a rd r g ev aluates p oin t grounding by chec king whether the predicted p oint u falls inside the demonstrated b ounding box b ⋆ , i.e., r g = 1 { u ∈ b ⋆ } . T ogether, these verifiers yield a step-wise signal: positive rewards indicate reliably correct predictions, whereas lo w rew ards ma y arise from either incorrect actions or v alid but uncredited alternativ es. This design matches the partial-verifiabilit y setting analyzed in Sec. 5.3 . 13 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Goal: I am looking for a new Sports shoe pair. I want to use Pumas' new app search feature because my old one is growing older Step Instruction: Click on the search key at the bottom right of the keyboard Action: {'action_type': 'wait' } Goal: I want to save this recipe of air - fried chicken thighs for later use . Step Instruction: Tap on the star icon to save the recipe Action: {'action_type': 'click', 'x': 540, 'y': 1429 } ( b ) Action T ype M ismatch (c) Coordinate Mismat ch Original Action History: ["[button] North America - > CLICK " , "[link] United States - English - > CLICK"] Ours: ["Click on the \" North America \" option in the region selection list to expand it and proceed with finding a UPS Access Point near SPRING, TX.", …] (a) Symbolic Action History Figure 6 Limitations of current offline b enchmarks. (a) Symbolic action history not in natural language in MM- Mind2W eb, (b) Action type mismatch and (c) coordinate mismatch in AndroidCon trol. 5.6 Overall T raining Framew o rk fo r GUI-Lib ra Figure 5 summarizes the ov erall training framework of GUI-Libra. Based on our augmented and filtered datasets, GUI-Libra consists of tw o stages: • In the SFT stage, we apply ASFT to equip the base mo del with action-aligne d r e asoning and mitigate grounding degradation caused by long CoT. ASFT mixes reasoning-then-action and direct-action sup ervision, and uses an action-a w are reweigh ted ob jective that emphasizes action and grounding tok ens. • In the RL stage, w e further optimize the p olicy with conserv ative GRPO under partially verifiable step- wise rew ards. T o stabilize learning and improv e offline-to-online predictability , we adopt a conserv ative RL des ign with tw o comp onents: (i) KL r e gularization to constrain distribution shift and the effect of am biguous rew ards, and (ii) suc c ess-adaptive ne gative gr adient sc aling to do wn w eight unreliable negative up dates caused by v alid-but-uncredited alternatives. Ov erall, GUI-Libra promotes step-wise impro vemen ts that are b ehavior al ly me aningful : gains in offline action matc hing are more lik ely to translate into b etter decisions along the policy’s own tra jectories. In addition, by lev eraging partially verifiable offline feedback, our framew ork enables scalable optimization on large static datasets without requiring costly online in teraction during training. 6 Exp eriments In this section, w e ev aluate GUI-Libra on a div erse set of offline and online GUI navigation b enchmarks. Bey ond ov erall results, w e also study the impact of our key design choices in b oth the SFT and RL stages, and examine when offline step-wise metrics can reliably predict online task success for GUI agents. 6.1 Exp erimental Setups GUI-Lib ra Details. W e train GUI-Libra mo dels from Qwen2.5-VL-3B/7B-Instruct ( Bai et al. , 2025b ) and Qw en3-VL-4B/8B-Instruct ( Bai et al. , 2025a ). W e use GUI-Libra-81K for SFT and a downsampled 40K subset for RL. F or SFT , we use a learning rate of 1 × 10 − 5 with an effectiv e batch size of 256, and set ASFT w eigh ts to α a = 2 and α g = 4 by default. T o ensure fair comparison, we train baselines on GUI-Libra-81K for t w o ep o chs, while mo dels trained with mixed reasoning and direct-action data (double size) for one ep o ch. Notably , our SFT corpus is substantially smaller than those used in recent GUI mo dels ( Y ang et al. , 2025c ; Liu et al. , 2025c ) and w e do not include any direct grounding-only data (e.g., low-lev el instructions paired with co ordinate sup ervision), fo cusing on GUI reasoning and m ulti-step navigation. F or RL , we use a learning rate of 1 × 10 − 6 , rollout batch size 256, group size 8, and KL co efficient 0.005 (7B) or 0.001 (others). While SNGS can improv e p erformance, it is sensitive to hyperparameters; therefore, for ablations unrelated to SNGS, 14 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 3 Step accuracy Performance on AndroidCon trol-v2. High Level Low Level Mo del P ass@1 P ass@4 P ass@1 P ass@4 Prop rietary Mo dels with SeeAct-V F ramewo rk GPT-4o + UGround-v1-7B 57.0 66.3 78.4 85.4 GPT-4.1 + UGround-v1-7B 57.5 63.3 78.4 83.2 GPT-5-mini + UGround-v1-7B 52.8 58.8 77.1 83.2 GPT-5 + UGround-v1-7B 61.3 69.4 86.2 90.0 Op en-source Native Mo dels GUI-R1-3B 40.0 54.0 55.8 71.9 GUI-R1-7B 39.7 56.3 62.3 72.6 Aguvis-7B 37.7 43.7 48.0 48.7 UI-T ARS-1.5-7B 45.2 63.1 48.5 70.9 GLM-4.1V-9B-Thinking 37.2 49.0 67.1 73.4 Qw en2.5-VL-32B 49.0 66.6 78.4 85.4 Qw en2.5-VL-72B 56.5 72.9 82.9 90.2 Qw en3-VL-32B 58.8 69.6 83.9 86.9 Qw en2.5-VL-3B (Baseline) 36.4 50.8 71.1 79.2 GUI-Lib ra-3B (Ours) 57.3 (+20.9) 67.1 (+16.3) 85.9 (+14.8) 90.5 (+11.3) Qw en2.5-VL-7B (Baseline) 46.5 58.5 67.8 81.7 GUI-Lib ra-7B (Ours) 59.3 (+12.8) 67.3 (+8.8) 85.2 (+17.4) 90.7 (+9.0) Qw en3-VL-4B (Baseline) 49.3 63.3 78.9 82.4 GUI-Lib ra-4B (Ours) 62.3 (+13.0) 68.6 (+5.3) 86.4 (+7.5) 93.0 (+10.6) Qw en3-VL-8B (Baseline) 54.8 66.1 77.6 83.2 GUI-Lib ra-8B (Ours) 64.3 (+9.5) 70.6 (+4.5) 88.9 (+11.3) 91.7 (+8.5) T able 4 Step accuracy Performance on Multimodal-Mind2W eb-v2. Cross-T ask Cross-Website Cross-Domain Average Model Pass@1 Pass@4 Pass@1 Pass@4 Pass@1 P ass@4 P ass@1 Pass@4 Proprieta ry Mo dels with SeeAct-V Framew ork GPT-4o + UGround-v1-7B 35.7 38.9 33.9 37.6 39.1 42.2 36.2 39.6 GPT-4.1 + UGround-v1-7B 41.1 44.8 36.2 39.7 43.0 46.4 40.1 43.6 GPT-5-mini + UGround-v1-7B 44.2 48.0 40.4 44.2 45.8 48.1 43.5 46.7 GPT-5 + UGround-v1-7B 47.7 51.7 45.0 47.6 48.2 51.6 47.0 50.3 Open-source Native Models GUI-R1-3B 24.0 37.7 22.3 37.7 24.6 39.1 23.6 38.2 GUI-R1-7B 37.0 50.1 34.1 46.3 39.6 50.5 36.9 49.0 Aguvis-7B 37.7 48.0 31.7 41.5 36.9 45.1 35.4 44.9 UI-T ARS-1.5-7B 37.2 48.0 31.3 42.9 35.6 47.2 34.7 46.0 GLM-4.1V-9B-Thinking 26.9 32.9 23.0 29.3 28.7 35.3 26.2 32.5 Qwen2.5-VL-32B 46.2 55.8 42.6 55.7 46.0 57.9 44.9 56.5 Qwen2.5-VL-72B 49.1 60.2 45.1 54.0 49.8 58.6 48.0 57.6 Qwen3-VL-32B 48.8 57.5 44.3 55.2 49.6 58.9 47.6 57.2 Qwen2.5-VL-3B (Baseline) 24.4 29.0 18.6 24.6 27.1 31.2 23.4 28.3 GUI-Libra-3B (Ours) 42.7 50.8 40.6 48.4 44.8 51.9 42.7 (+19.3) 50.3 (+22.0 Qwen2.5-VL-7B (Baseline) 31.6 45.6 30.4 42.1 35.6 48.0 32.5 45.2 GUI-Libra-7B (Ours) 46.3 52.8 45.5 52.2 47.6 55.7 46.5 (+14.0) 53.6 (+8.4) Qwen3-VL-4B (Baseline) 42.9 52.2 38.8 50.0 42.0 51.7 41.2 51.3 GUI-Libra-4B (Ours) 50.8 56.3 48.4 53.2 50.8 57.5 50.0 (+8.8) 55.6 (+4.3) Qwen3-VL-8B (Baseline) 44.7 54.1 41.0 51.0 45.6 53.4 43.8 52.8 GUI-Libra-8B (Ours) 51.2 55.3 47.9 53.6 52.4 56.7 50.5 (+6.7) 55.2 (+2.4) 15 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Typ i n g… 1 2 3 4 5 6 7 8 Step 1: Reasoning: … I need to locate it on the dev ice. Since it is not visible on the home screen, I should swipe up to open the app drawer and look for the OpenTracks app among the list of installed applications… Action: Swipe up from the bottom of the home screen to open th e app drawer and view all in stalled applications. Step 2: Reasoning: … Since the OpenTracks app is not v isible in the current view, I ne ed to locate it. The most l ogical next step is to swipe up or sideways to reveal mo re apps, as the OpenTracks app migh t be on a subsequent page… Action: Swipe up in the app drawer to reveal more apps and look for th e OpenTracks app. Step 3: Reasoning: … I see the "OpenTracks" app icon in the third row, sec ond column from the left. S ince the app is visible and accessible, the next logical step is to tap on the OpenTracks app icon to launch it and proceed to the next step… Action: Tap on the OpenTracks app icon to open the application . Step 4: Reasoning: … To proceed, I need to access the main interface of th e app where I can vie w and analyze hiking activities. The l ogical next step is to tap the orange play button, which is commonly used to s tart the main interface… Action: Tap the orange play button in the bottom right corner to pro ceed to the main interface of the OpenTracks app. Step 5: Reasoning: … The orange play button in the bottom right corner is a common UI element for starting or continuing an activity or returning to the main screen. Tapping this button is the logical next step to access the list o f activities.. Action: Tap the orange play button in the bottom right corner to pro ceed to the main interface of the OpenTracks app. Step 6: Reasoning: …There is a magnifying glass ic on at the top right, whic h is common ly used for search or filtering in apps. Tapping this icon wil l likely allow me to filter the list by activity type, such as "hiking," so I can then identify the longest distance for hiking activities this week. Action: Tap the magnifying glass icon at the top right to open the search or filte r menu. Step 7: Reasoning: … Th e search bar at the top is active, and the keyboard is ope n, indicating I can type a query. Since the instructi on specifies "hiking" and the w eek starts from Monday, I should enter "hiking" into the search bar to filter the list and identif y the relevant activities… Action: Type 'hiking' into the search bar to filter the li st for hiking activities .. Step 8: Reasoning: … it is reasonable to assume that this is the longes t distance covered in a hiking activity for the week. There are no other activities to compare, and the week's start is not spec ified, so the activity for today is the only one to consider. Therefore, the ans wer is 1207 meters... Action: The longest distance covered in a hiking activity this week is 1207 meters.. Figure 7 T ra jectory Example of GUI-Libra-7B on AndroidW orld. w e use KL-regularized GRPO to isolate the effects of the other comp onen ts. Additional implementation details are provided in App endix B . Evaluation Benchma rks. W e ev aluate mo dels on b oth offline and online b enchmarks. F or offline evaluation , we follo w UGround ( Gou et al. , 2025 ) but substantially refine the underlying datasets to improv e annotation qualit y and realism. As illustrated in Figure 6 , the original MM-Mind2W eb ( Zheng et al. , 2024 ) uses symbolic action histories that do not reflect real-w orld usage, while AndroidCon trol ( Li et al. , 2024 ) con tains roughly 20% errors in action t yp es and co ordinates. T o address these issues, we enhance AndroidControl and MM-Mind2W eb b y correcting lab el errors and translating non-natural symb olic action histories, yielding AndroidControl-v2 and Multimo dal-Mind2W eb-v2 (MM-Mind2Web-v2) , resp ectively . W e rep ort step success rate, which requires the predicted action type, textual v alue, and co ordinates to b e correct, and include b oth P ass@1 and P ass@4 step accuracy . F or AndroidControl-v2, follo wing UGround ( Gou et al. , 2025 ), we ev aluate on 398 filtered samples with both high-level and low-lev el instructions. F or online evaluation , w e use AndroidW orld ( Rawles et al. , 2025 ), W ebArena-Lite-v2 ( Liu et al. , 2025c ), and Online-Mind2W eb ( Xue et al. , 2025 ), which assess agen ts in realistic interactiv e en vironmen ts. Notably , Online-Mind2W eb is ev aluated on live w ebsites, introducing additional real-w orld v ariability and complexity . W e follo w the official proto cols and rep ort task success rate as the primary metric, with a maximum of 20 steps for AndroidW orld, 15 for W ebArena-Lite-v2, and 30 for Online-Mind2W eb. Additional details are provided in App endix C . Baselines. W e compare GUI-Libra series against a diverse set of native GUI agents, including Qwen2.5- VL-3/7/32/72B ( Bai et al. , 2025b ), Qw en3-VL-4/8/32B ( Bai et al. , 2025a ), Aguvis-7B ( Xu et al. , 2025c ), UI-T ARS-1.5-7B ( Qin et al. , 2025b ), GLM-4.1-V-9B-Thinking ( Hong et al. , 2025 ), and GUI-R1-3/7B ( Luo et al. , 2025 ). W e also ev aluate proprietary mo dels paired with a grounding mo dule, following UGround ( Gou et al. , 2025 ), including GPT-4o, GPT-4.1, GPT-5-mini, and GPT-5, and include rep orted/repro duced results from ScaleCUA ( Liu et al. , 2025c ) on the t wo online benchmarks. Because ev aluation pip elines can substan tially affect rep orted p erformance and are often unreleased by previous w ork, we make our b est effort to ev aluate all mo dels under a unified and consisten t proto col for fair comparison. 6.2 P erformance on Offline and Online GUI Navigation Benchmarks 6.2.1 Offline Benchmarks T ables 3 and 4 rep ort step-wise accuracy on AndroidControl-v2 and MM-Mind2W eb-v2, comparing GUI- Libra with open-source native GUI mo dels and proprietary systems using the SeeAct-V ( Zheng et al. , 16 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 5 Performance on the online b enchmark AndroidW orld in 20 steps. ∗ denotes n umbers rep orted b y original pap ers. Left: Native Mo dels (single VLM). Right: A gent F r ameworks ( ≥ 2 VLM mo dules). (a) Native Mo dels Mo del Acc. UI-T ARS-1.5-7B 16.5 GLM-4.1V-9B-Thinking 18.3 Qw en2.5-VL-32B 29.6 Qw en2.5-VL-72B 32.2 Qw en3-VL-32B 34.8 Qw en2.5-VL-3B (Baseline) 3.5 GUI-Lib ra-3B (Ours) 25.2 Qw en2.5-VL-7B (Baseline) 7.8 GUI-Lib ra-7B (Ours) 29.6 Qw en3-VL-4B (Baseline) 27.0 GUI-Lib ra-4B (Ours) 42.6 Qw en3-VL-8B (Baseline) 30.4 GUI-Lib ra-8B (Ours) 42.6 (b) Agent F ramew orks ( ≥ 2 VLM Mo dules) Mo d el Additional Mo dule Acc. Qw en2.5-VL-3B Step-wise Summary 7.0 Qw en2.5-VL-7B Step-wise Summary 15.7 Qw en3-VL-4B Step-wise Summary 36.5 Qw en3-VL-8B Step-wise Summary 39.1 ScaleCUA-3B ∗ Step-wise Summary 23.7 ScaleCUA-7B ∗ Step-wise Summary 27.2 ScaleCUA-32B ∗ Step-wise Summary 30.6 GLM-4.1V-9B-Thinking UGround-v1-7B 20.9 GPT-4o UGround-v1-7B 42.6 GPT-4.1 UGround-v1-7B 37.4 GPT-5-mini UGround-v1-7B 40.9 GPT-5 UGround-v1-7B 48.7 2024 ) framework. Overall, the GUI-Libra series achiev es the b est Pass@1 p erformance on b oth b enchmarks, outp erforming not only similarly sized mo dels but also several substantially larger op en-source and proprietary mo dels. Importantly , GUI-Libra consisten tly improv es ov er its corresp onding base mo dels. F or example, GUI-Libra-3B improv es Pass@1 ov er Qw en2.5-VL-3B by +20 . 9 and +14 . 8 on AndroidControl-v2 high-level and lo w-lev el tasks, resp ectively , and by +19 . 3 on the av erage P ass@1 of MM-Mind2W eb-v2. W e also observe a clear scaling trend for b oth Qwen baselines and GUI-Libra mo dels, indicating that larger mo dels hav e greater p oten tial to ac hiev e strong offline decision-making p erformance. In terms of Pass@4, large mo dels (e.g., Qw en2.5-VL-72B and Qwen3-VL-32B) can b e comp etitiv e, but they rely on substan tially more parameters. In con trast, GUI-Libra is more parameter-efficien t and consistently outp erforms mo dels at similar scale, and ev en GPT5. F or instance, GUI-Libra-3B improv es Pass@4 ov er Qwen2.5-VL-3B by +16 . 3, and +19 . 3 on AndroidCon trol-v2 (high-level) and MM-Mind2W eb-v2, resp ectively . W e further find that gains on Qwen3-based bac kb ones are relatively smaller than those on Qwen2.5-based ones, whic h we attribute to Qw en3’s heavier p ost-training (esp ecially RL for reasoning) that already strengthens planning and decision-making. Nev ertheless, GUI-Libra still provides meaningful improv ements: GUI-Libra- 4/8B outp erforms Qwen3-VL-4/8B by 13.0/9.5 p oints on Pass@1 of AndroidControl-v2 (high-level) and by 8.8/6.7 p oints on MM-Mind2W eb-v2, demonstrating consistent b enefits even with strong pretrained bac kb ones. 6.2.2 Online Benchmarks T ables 5 , 6 , and 7 rep ort task success rates on AndroidW orld, W ebArena-Lite-v2, and Online-Mind2W eb, resp ectiv ely . Ho wev er, many prior works do not rele ase their ev aluation frameworks, whic h hinders repro- ducibilit y and can compromise fair comparison. Ev en op en-source studies such as ScaleCUA ( Liu et al. , 2025c ) rep ort results from agent frameworks augmented with additional mo dules (e.g., step-wise summaries) on AndroidW orld as native agen t p erformance. As sho wn in T able 5 , step-wise summaries improv e Qwen3-VL-4B and Qwen3-VL-8B by 9.5 and 8.7 p oints ov er their native counterparts, resp ectively . T o supp ort a fair comparison, we therefore explicitly rep ort results for true native mo dels and agent frameworks. On AndroidWo rld (T able 5 ), GUI-Libra substan tially strengthens native GUI mo dels across all scales. Relativ e to their corresp onding baselines, GUI-Libra yields large and consistent gains: GUI-Libra-3B increases the success rate from 3 . 5 to 25 . 2 (+21.7) and GUI-Libra-8B from 30 . 4 to 42 . 6 (+12.2). Notably , GUI-Libra-4B/8B (42 . 6) surpass several m uc h larger native mo dels (e.g., Qwen2.5-VL-32/72B and Qwen3-VL-32B), and also matc h or outp erform multi-module agent framew orks that add external step-wise summary mo dules. F or example, Qwen3-VL-8B with step-wise summary reaches 39 . 1, whereas our native GUI-Libra-8B achiev es 42 . 6. Moreo v er, GUI-Libra-4B/8B reaches p erformance comparable to strong proprietary systems such as GPT-4o (42 . 6) and GPT-5-mini (40 . 9) equipp ed with UGround, despite using a simpler single-VLM architecture. W e 17 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 6 Performance comparison on WebArena-Lite-v2 in 15 steps. ∗ denotes num b ers rep orted in Liu et al. ( 2025c ). GitLab MAP Reddit Shopping ShoppingAdmin Average Native Models Aguvis-72B ∗ - - - - - 5.8 Qwen2.5-VL-72B ∗ - - - - - 15.6 InternVL3.5-241B-A28B ∗ - - - - - 11.7 UI-T ARS-1.5-7B ∗ - - - - - 20.8 UI-T ARS-72B-DPO ∗ - - - - - 23.4 ScaleCUA-3B 21.7 7.7 13.2 16.5 23.6 17.2 ScaleCUA-7B 28.3 15.4 27.6 18.8 30.7 23.9 ScaleCUA-32B 34.2 10.6 26.3 16.5 33.6 24.0 Qwen2.5-VL-3B 1.7 0.0 0.0 1.7 0.0 0.8 GUI-Libra-3B (Ours) 25.8 9.6 18.4 17.6 12.1 16.7 Qwen2.5-VL-7B 8.3 1.0 2.6 7.4 2.9 4.9 GUI-Libra-7B (Ours) 25.0 10.6 26.3 26.1 22.9 22.6 Qwen3-VL-4B 17.5 5.8 10.5 13.1 10.7 11.9 GUI-Libra-4B (Ours) 29.2 10.6 34.2 30.1 17.9 24.4 Qwen3-VL-8B 15.0 5.8 17.1 17.0 19.3 15.3 GUI-Libra-8B (Ours) 31.7 17.3 35.5 26.1 25.0 26.6 Agent F ramewo rk (GPT-4o as the Planner) GPT-4o + UI-T ARS-1.5-7B ( Qin et al. , 2025b ) ∗ - - - - - 22.6 GPT-4o + UGround-V1-7B ( Gou et al. , 2025 ) ∗ - - - - - 23.2 GPT-4o + ScaleCUA-7B ( Liu et al. , 2025c ) ∗ - - - - - 28.6 further provide qualitative examples of GUI-Libra-7B successfully completing AndroidW orld tasks in Figures 7 and App endix G . On W ebArena-Lite-v2 (T able 6 ), a lo cally deplo yed web b enchmark (rather than live websites) , GUI-Libra shows strong generalization across diverse w eb tasks despite b eing trained on only 15K web-related samples, far few er than the web corp ora used by many existing GUI mo dels. Even in this lo w-data regime, GUI-Libra deliv ers large gains ov er its base mo dels: GUI-Libra-7B improv es the av erage success rate from 4 . 9 to 22 . 6, and GUI-Libra-8B increases p erformance from 15 . 3 to 26 . 6. These results are comp etitive with strong proprietary systems such as GPT-4o equipp ed with UI-T ARS and UGround. Moreo ver, GUI-Libra-8B outp erforms large-scale mo dels including ScaleCUA-32B, UI-T ARS-72B, and Aguvis-72B, all trained on substan tially larger web datasets. On Online-Mind2Web (T able 7 ), which ev aluates agents on live w ebsites with real-world v ariability , we ev aluate GUI-Libra using tw o independent judge mo dels: o4-mini and W ebJudge-7B ( Xue et al. , 2025 ). GUI-Libra consisten tly impro ves ov er its corresp onding base mo dels across all difficulty levels. In particular, GUI-Libra- 8B increases the av erage o v erall score from 19 . 3 (Qwen3-VL-8B) to 28 . 0, achieving the b est result among all ev aluated native mo dels, including those with substantially more parameters. Similarly , GUI-Libra-7B impro v es from 15 . 8 (Qw en2.5-VL-7B) to 25 . 5, GUI-Libra-4B from 21 . 7 (Qw en3-VL-4B) to 25 . 7. Even at the 3B scale, GUI-Libra-3B ac hiev es an av erage ov erall of 21 . 3, a notable leap from 4 . 8 for Qwen2.5-VL-3B. Notably , while the Sc aleCUA family p erforms c omp etitively on lo c al ly deploye d b enchmarks, its p erformanc e is less c omp etitive on live websites : ScaleCUA-7B and ScaleCUA-32B reach only 23 . 7 and 23 . 5 av erage ov erall, b oth of which are surpassed by GUI-Libra-4/7/8B. T ogether, these results suggest that GUI-Lib ra not only closes the gap b etw een smaller op en-source mo dels and larger agents, but also provides stronger robustness and generalization in realistic, dynamically changing web environments. Ov erall, GUI-Libra deliv ers consistent gains on online mobile and web b enchmarks, generalizing from lo cally deplo y ed en vironments to live w ebsites. It matches or surpasses larger models while remaining highly data-efficien t, using relatively small training data, esp ecially for the web domain. 18 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 7 Performance comparison on Online-Mind2W eb in 30 steps using o4-mini and W ebJudge-7B as judges. Model Judge o4-mini W ebJudge-7B Avg. Easy Medium Hard Overall Easy Medium Hard Overall Overall Agent F ramewo rk GPT-4o + UGround-v1-7B 38.8 14.0 6.5 18.7 45.0 27.3 19.5 30.0 24.3 GPT-4.1 + UGround-v1-7B 41.3 21.7 5.2 22.7 47.5 35.0 28.6 36.7 29.7 GPT-5 + UGround-v1-7B 40.0 24.5 14.3 26.0 45.0 31.5 24.7 33.3 29.7 Native Models Qwen2.5-VL-32B 12.5 7.7 1.3 7.3 28.8 14.7 19.5 19.7 13.5 Qwen3-VL-32B 33.8 17.5 7.8 19.3 45.0 31.5 28.6 34.3 26.8 ScaleCUA-3B 30.0 4.9 2.6 11.0 37.5 18.2 11.7 21.7 16.3 ScaleCUA-7B 33.8 14.7 3.9 17.0 47.5 27.3 18.2 30.3 23.7 ScaleCUA-32B 31.3 14.7 6.5 17.0 43.8 29.4 16.9 30.0 23.5 Qwen2.5-VL-3B 3.8 0.0 1.3 1.3 16.3 7.7 1.3 8.3 4.8 GUI-Libra-3B (Ours) 28.8 9.8 5.2 13.7 47.5 21.0 24.7 29.0 21.3 Qwen2.5-VL-7B 22.5 7.7 0.0 9.7 36.3 18.9 13.0 22.0 15.8 GUI-Libra-7B (Ours) 36.3 15.4 2.6 17.7 47.5 30.8 23.4 33.3 25.5 Qwen3-VL-4B 33.8 10.5 6.5 15.7 43.8 23.8 18.2 27.7 21.7 GUI-Libra-4B (Ours) 36.3 18.2 6.5 20.0 45.0 30.1 19.5 31.3 25.7 Qwen3-VL-8B 23.8 9.8 0.0 11.0 43.8 23.1 19.5 27.7 19.3 GUI-Libra-8B (Ours) 31.3 17.5 10.4 19.3 42.5 37.8 28.6 36.7 28.0 6.3 Action-a wa re SFT and RL Mitigate Grounding Perfo rmance Degradation Figure 8 analyzes ho w grounding p erformance v aries with resp onse length for the base mo del (Qwen2.5-VL-3B) and SFT v ariants trained with different strategies. W e ev aluate on ScreenSp ot-v2, group outputs into 30-token length bins, discard bins with fewer than 20 samples, and rep ort the av erage grounding accuracy within eac h bin. This setup enables a fine-grained analysis of ho w increasingly long CoT outputs affect grounding p erformance. As resp onse length increases, b oth the base mo del and standard SFT exhibit a pronounced degradation in grounding accuracy , indicating that long-form reasoning in terferes with precise action execution. In contrast, action-a wa re SFT substantially mitigates this degradation across all resp onse lengths . By incorp orating direct-action sup ervision, action-aw are SFT supp orts b oth reasoning and no-reasoning mo des. W eighted training ob jectiv es further stabilize p erformance under long resp onses. In particular, stronger w eigh ting strategies preserv e high grounding correctness even b eyond 250 tok ens, significantly outp erforming b oth the base mo del and standard SFT. T able 8 rep orts ov erall grounding accuracy and av erage resp onse length across different mo dels and inference mo des. Models trained with mixed data can b e flexibly prompted to inference in either reasoning or no- reasoning mo des, denoted as R e ason and No-R e ason . Across b oth 3B and 7B scales, mixed-data training and action-aw are weighting consistently improve average grounding accuracy in b oth mo des . F or example, at the 7B scale, mixed-data SFT improv es grounding accuracy in reasoning mo de from 79.0% to 81.4%, while action-a w are weigh ting further increases it to 83.4%. Despite these improv ements, ASFT alone do es not fully eliminate the gap b etw een reasoning and no-reasoning mo des . F or instance, ASFT-3B in reasoning mo de still underp erforms its no-reasoning counterpart b y 4.8 p oints, and ASFT-7B exhibits a similar 3.4-p oint gap, suggesting that residual interference b etw een reasoning and grounding remains. This gap is largely eliminated after our RL training. GUI-Libra mo dels achiev e comparable, and even sup erior, grounding accuracy in reasoning mo de despite pro ducing longer resp onses compared to the no-reasoning mo de. In particular, GUI-Libra-7B attains higher grounding accuracy in reasoning mo de than in no-reasoning mo de (89.3% vs. 88.5%), and GUI-Libra-3B achiev es similar accuracy (83.4% vs. 83.2%) while generating substan tially more tok ens (176 vs. 124 and 206 vs. 59, resp ectively). Notably , our RL stage do es not use direct grounding sup ervision, unlike prior work ( Luo et al. , 2025 ; Lu et al. , 2025 ); instead, it leverages step-wise data derived from high-lev el and multi-step tasks. These results demonstrate that RL further reshap es the p olicy to better align reasoning with grounding, fully mitigating grounding degradation under long CoT outputs and 19 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL 100 150 200 250 300 R esponse length 0.4 0.5 0.6 0.7 0.8 0.9 Gr ounding cor r ectness Base Model SF T SF T + Mix ed Data A S F T ( a = 1 , g = 2 ) A S F T ( a = 3 , g = 6 ) Figure 8 Grounding accuracy under differen t response lengths. Action-aw are SFT strategies, including mix- ing direct-action data and weigh ted ob jectives, help preserv e grounding accuracy under long CoT outputs. Model Name Inference Mode Average T ok ens Grounding Accuracy (%) SFT-3B Reason 223.4 73.4 SFT 3B + Mixed Data No-Reason 77.3 79.4 SFT 3B + Mixed Data Reason 200.6 73.8 ASFT 3B No-Reason 67.7 81.0 ASFT 3B Reason 200.2 76.2 GUI-Libra-3B (ASFT+RL) No-Reason 59.0 83.2 GUI-Libra-3B (ASFT+RL) Reason 206.5 83.4 SFT-7B Reason 218.2 79.0 SFT 7B + Mixed Data No-Reason 69.4 85.6 SFT 7B + Mixed Data Reason 168.1 81.4 ASFT 7B No-Reason 76.3 86.8 ASFT 7B Reason 169.6 83.4 GUI-Libra-7B (ASFT+RL) No-Reason 124.4 88.5 GUI-Libra-7B (ASFT+RL) Reason 176.1 89.3 T able 8 Grounding accuracy and a v erage resp onse tok ens across different mo dels and inference mo des. “Reason” and “No- Reason” indicate whether explicit reasoning mode is encouraged through prompting. (a) (b) (c) (d) Figure 9 Comparison of training and ev aluation metrics with and without KL regularization: (a) training reward, (b) p olicy en tropy during training, (c) offline ev aluation p erformance on AndroidControl-High, and (d) online ev aluation p erformance on AndroidW orld. complementing the b enefits of action-a wa re SFT. 6.4 On the Effectiveness of KL Regularization for RL As analyzed in Section 5.3.2 , KL regularization theoretically controls b oth distribution shift and reward am biguit y , making offline step-wise metrics more reliable predictors of online task completion. T o v alidate this empirically , we visualize training and ev aluation metrics on AndroidControl-High and AndroidW orld in Figure 9 , which represent t ypical offline and online settings that share the same action space. W e compare RL runs with and without KL regularization, using KL co efficients of 0.0 vs. 0.005 for 7B mo dels and 0.0 vs. 0.001 for 3B mo dels, with identical initialization. As shown in Figure 9 (a), training reward curves largely o v erlap across different KL settings, indicating similar reward optimization behavior. How ever, ev aluation b eha vior differs markedly: in Figure 9 (c) and (d), mo dels trained without KL regularization exhibit noticeable p erformance degradation despite increasing training rewards, reflecting a form of reward hacking commonly observ ed in RLHF. Moreov er, Figure 9 (b) shows that removing KL regularization leads to a pronounced decrease in p olicy entrop y , indicating premature p olicy collapse and ov erfitting. In contrast, a small KL p enalt y stabilizes p olicy entrop y and yields more consistent offline and online p erformance. T o quantitativ ely examine ho w well offline metrics predict online p erformance, w e plot the offline and online scores of all in termediate chec kp oints from b oth 3B and 7B mo dels in Figure 10 . The results rev eal an 20 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL (b) (a) Figure 10 (a) Correlation b etw een offline and online p erformance. (b) Comparison of P earson and Sp earman correlations with and without KL regularization. (a) (b) Figure 11 Ablation study of data filtering at the (a) SFT and (b) RL stages. Data filtering consisten tly impro ves b oth P ass@1 and P ass@4 p erformance across three b enc hmarks. appro ximately linear relationship, supp orting our theoretical analysis in Theorem 5.1 . In Figure 10 (b), we rep ort b oth P earson correlation, which measures linear dep endence, and Sp earman correlation, which captures rank consistency b etw een offline and online p erformance. Although the o v erall Pearson correlation is moderate ( r = 0 . 76), further analysis sho ws substantial differences across different KL regularization settings. When separating c heckpoints by KL strength, mo dels trained with KL regularization ( KL > 0) exhibit significantly stronger alignmen t b etw een offline and online p erformance. As shown in Figure 10 (b), KL-regularized mo dels ac hiev e Pearson and Sp earman correlations of 0 . 89 ( p < 10 − 4 ) and 0 . 83 ( p = 2 × 10 − 4 ), resp ectiv ely , indicating strong and statistically robust dep endence. In con trast, mo dels trained without KL regularization sho w w eak er correlations (Pearson r = 0 . 63, p = 0 . 015; Sp earman r = 0 . 53, p = 0 . 053), with the rank correlation failing to reach conv entional significance levels. Overall, these results pro vide strong empirical evidence that KL regularization improves training stability and enhances the predictabilit y of online task success from offline evaluations , complementing our theoretical analysis in Section 5.3.2 . 6.5 Ablations In this section, we conduct a series of ablation studies to b etter understand the con tribution of individual designs and parameters in our pip eline on GUI na vigation tasks. Sp ecifically , we examine the impact of our data filtering strategies, analyze the role of each comp onent in ASFT, and study the impact of differen t KL co efficien t during RL training. Ablation of Data Filtering for SFT and RL. In Sections 4.1.4 and 4.1.5 , we introduce data filtering pipelines for b oth the SFT and RL stages, where approximately half of the original dataset is retained. Specifically , for SFT, we remov e low-qualit y and ambiguous samples, while for RL, we reduce domain imbalance and early-step bias. Figure 11 rep orts results using a 3B base mo del. Across most settings, filtering consistently impro ves p erformance, with esp ecially large gains in Pass@4. F or example, in SFT, filtering improv es AndroidCon trol- 21 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL High by +4 . 5 Pass@1 and +6 . 3 Pass@4, while in RL it yields an additional +0 . 5 Pass@1 and +3 . 7 Pass@4. These results highligh t the imp ortance of data qualit y in b oth SFT and RL: fo cusing on a smaller but cleaner and less biased dataset can generalize b etter than using a larger, noisier co rpus. T able 9 Ablations study on ASFT, KL regularization, and reasoning across benchmarks with Qwen2.5-VL-3B as the base mo del. Model MM-Mind2Web-v2 AC-v2 (High) AC-v2 (Low) AndroidWo rld Pass@1 Pass@4 P ass@1 Pass@4 P ass@1 Pass@4 Base Model 23.4 28.3 36.4 50.8 71.1 79.2 3.5 SFT 28.5 36.9 45.7 59.8 73.1 83.2 5.2 SFT+ Mixed Data 30.2 42.0 45.5 64.8 72.6 85.4 11.3 ASFT 32.0 41.3 44.5 64.6 75.4 86.9 13.0 GUI-Libra w/o ASFT (KL reg=0.0) 40.9 45.1 50.5 54.8 78.6 81.9 17.4 GUI-Libra w/o ASFT (KL reg=0.001) 41.9 48.0 57.0 63.6 86.2 90.0 20.9 GUI-Libra (KL reg=0.0) 43.8 49.2 49.8 58.0 87.2 91.0 21.7 GUI-Libra (KL reg=0.001) 42.7 50.2 55.8 65.8 89.5 91.5 25.2 GUI-Libra (KL reg=0.01) 43.4 50.6 51.5 64.8 85.9 92.0 21.7 GUI-Libra (KL reg=0.05) 41.4 51.1 49.8 66.1 87.9 92.2 20.0 ASFT w/o CoT 35.2 41.4 40.2 56.0 71.4 87.7 5.2 GUI-Libra (KL reg=0.001) w/o CoT 43.4 47.1 48.7 55.8 85.2 87.7 12.2 ASFT infer w/o CoT 37.1 43.8 42.7 58.3 75.6 89.2 8.7 GUI-Libra (KL reg=0.001) infer w/o CoT 42.5 48.7 52.0 59.3 88.2 92.2 18.3 Ablations on ASFT and RL for Navigation T asks. In Section 6.3 , we analyzed the effects of action-aw are SFT and RL on mitigating grounding degradation under long CoT outputs. Here, w e further examine their impact on navigation p erformance. T able 9 rep orts results on both offline and online b enchmarks using the Qwen2.5-VL-3B backbone. W e observe an impro ving trend on most b enchmarks and metrics as w e progress from the base mo del to SFT, mixed-data SFT, and ASFT. In particular, incorp orating mixed sup ervision and action-aw are w eigh ting improv es Pass@1 on MM-Mind2W eb-v2 from 23.4 to 32.0 and on AndroidCon trol-v2 (High) from 36.4 to 44.5, while AndroidW orld sucess rate increases from 3.5 to 13.0. These results indicate that all comp onents of ASFT b enefit not only grounding accuracy , but also offline na vigation and long-horizon online decision making. Bey ond ASFT, RL brings further substantial gains. With RL training and mo derate KL regularization (e.g., KL = 0 . 001), Pass@1 and Pass@4 on MM-Mind2W eb-v2 impro v e by 10.7 and 8.9 p oints o v er ASFT, resp ectively , and AndroidW orld p erformance increases mark edly from 13.0 to 25.2, highlighting the limitations of sup ervised fine-tuning alone and the imp ortance of RL for generalization in dynamic environmen ts. Ablations of KL Regularization Co efficient. W e further observe that the KL co efficient plays an imp ortan t role in balancing Pass@1 and P ass@4 p erformance. As shown in T able 9 , increasing the KL co efficient generally impro v es Pass@4 p erformance, while P ass@1 ma y drop sligh tly . This trend is consisten t with our observ ation that stronger KL regularization retains higher p olicy en tropy . Within this trade-off, mo derate regularization (e.g., KL = 0 . 001) yields strong and stable results across b enchmarks, achieving the highest Pass@1 on AndroidCon trol-v2 and comp etitive Pass@1 and Pass@4 on MM-Mind2W eb-v2, while substantially improving online p erformance on AndroidW orld. In con trast, ov erly large penalties (e.g., KL = 0 . 05) or removing KL regularization tend to degrade o v erall p erformance, reducing the AndroidW orld success rate to near 20.0. These results further v alidate that mo derate KL regularization effectively balances distribution shift and rewa rd ambiguit y , whereas excessively large p enalties lead to overly conservative p olicies . Ablations of Reasoning in Mo del T raining and Inference. W e analyze the role of reasoning in b oth training and inference b y systematically ablating CoT usage in ASFT and GUI-Libra, as summarized in T able 9 . W e first consider models trained without CoT and evaluated without CoT (denoted as ASFT w/o CoT and GUI-Libra w/o CoT) . Compared to mo dels trained and ev aluated with CoT, p erformance degrades across most b enc hmarks, 22 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 10 Comparison b etw een GUI-Libra w/ SNGS and w/o SNGS across b enchmarks. Model MM-Mind2Web-v2 AC-v2 (High) A C-v2 (Lo w) AndroidW orld W ebArena-Lite-v2 Pass@1 Pass@4 Pass@1 Pass@4 Pass@1 P ass@4 GUI-Libra-4B (w/o SNGS) 49.1 55.1 59.8 69.9 87.7 92.0 39.1 22.2 GUI-Libra-4B (w/ SNGS) 50.0 55.6 62.3 68.6 86.4 93.0 42.6 24.4 with the most pronounced drops on the online b enchmark AndroidW orld: GUI-Libra’s success rate decreases from 25.2 to 12.2, and ASFT drops from 13.0 to 5.2. The declines are substan tially larger than those observed on offline b enchmarks, highlighting the imp ortance of CoT for generalization in dynamic online environmen ts. Next, w e remov e CoT only at inference time (denoted as ASFT infer w/o CoT and GUI-Libra infer w/o CoT) , while using c heckpoints trained with CoT. Since ASFT incorp orates direct-action sup ervision, these mo dels can b e prompted to pro duce direct actions at inference. Under this setting, ASFT sho ws improv ed p erformance on sev eral offline b enchmarks, such as MM-Mind2W eb-v2 and AndroidControl-v2 (Lo w), compared to inference with CoT, but still exhibits a notable drop on AndroidW orld (13.0 → 8.7). In contrast, GUI-Libra consisten tly degrades on b oth offline and online b enchmarks (except AndroidControl-v2 (Low)) when CoT is remo ved at inference. This suggests that ASFT can b enefit from direct-action inference on in-distribution tasks, whereas RL relies more strongly on the join t presence of reasoning traces and actions to lev erage their coupling for decision making. Imp ortantly , across all ablations, training with CoT consisten tly yields b etter p erformance than remo ving CoT during training, even when inference is ultimately p erformed without CoT. Overall, these results highlight that explicit reasoning during b oth training and inference is imp ortant for effective GUI agen t, esp ecially for strong online generalization. Ablation of SNGS. T able 10 examines the effect of SNGS on GUI-Libra-4B. Overall, enabling SNGS consisten tly impro v es online generalization, b o osting p erformance on b oth AndroidW orld and W ebArena-Lite-v2. F or example, GUI-Libra-4B improv es from 39.1 → 42.6 (+3.5) on AndroidW orld and from 22.2 → 24.4 (+2.2) on W ebArena-Lite-v2, resp ectively . On offline b enchmarks, SNGS yields smaller but generally p ositive gains on reasoning-demanding settings suc h as AndroidControl-v2 (High) and MM-Mind2W eb-v2. These gains come with minor trade-offs on low-lev el metrics, i.e., AndroidControl-v2 (Low) Pass@1, suggesting that SNGS reduces o verfitting to short-horizon action prediction and instead fa v ors generalizable reasoning and more robust online b ehavior. T able 11 Effect of mixing grounding data into RL training on ScreenSp ot-v2 (SS-v2), ScreenSpot-Pro (SS-Pro), MM-Mind2W eb-v2, and AndroidControl-v2 (AC-v2). W e rep ort Pass@1 on all b enchmarks. Green arro ws indicate p erformance gains, and red arro ws indicate degradations relativ e to the corresponding GUI-Libra mo dels. SS-v2 SS-Pro MM-Mind2W eb-v2 A C-v2 (high) AC-v2 (low) Qwen3-VL-4B 91.7 52.8 41.2 49.3 78.9 GUI-Libra-4B 92.3 54.3 49.1 59.8 87.7 GUI-Libra-4B + Mix Grounding 20k 94.6 ↑ (+2.3) 61.4 ↑ (+7.1) 43.9 ↓ (-5.2) 61.3 ↑ (+1.5) 84.9 ↓ (-2.8) Qwen3-VL-8B 92.1 52.7 43.8 54.8 77.6 GUI-Libra-8B 90.7 54.1 50.3 65.6 88.7 GUI-Libra-8B + Mix Grounding 20k 94.8 ↑ (+4.1) 59.9 ↑ (+5.8) 49.5 ↓ (-0.8) 61.7 ↓ (-3.9) 86.4 ↓ (-2.3) 6.6 RL with Mixed Navigation and Grounding Data Prior w ork ( Y ang et al. , 2025b ; Luo et al. , 2025 ; Lu et al. , 2025 ) sho ws that adding direct grounding sup ervision (elemen t descriptions paired with co ordinates) during RL can substan tially improv e visual lo calization. T o study ho w this sup ervision affects b oth grounding and reasoning, we take the grounding dataset from ( Y ang et al. , 2025b ), downsample 20K examples, and mix it with our 40K navigation-focused RL dataset for joint training. W e conv ert grounding samples into our unified action format by k eeping only click actions, and apply the same reward computation as in our RL pip eline. T able 11 rep orts results on grounding b enchmarks (ScreenSp ot-v2 and ScreenSp ot-Pro) and navigation b enchmarks (MM-Mind2W eb-v2 and AndroidCon trol-v2). 23 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL W e observe a clear trade-off. Mixing grounding data consistently improv es grounding accuracy on ScreenSp ot- v2 and ScreenSpot-Pro by 2–7 p oin ts, indicating stronger visual lo calization. In contrast, p erformance on na vigation b enchmarks generally declines, suggesting w eaker reasoning and decision making. Overall, these results reveal competing optimization pressures: adding direct grounding supervision strengthens spatial alignment, but can reduce p erformance on reasoning-intensive navigation tasks . 7 Conclusion W e introduce GUI-Libra, a unified framework for training reasoning-capable native GUI agents based on our curated GUI-Libra-81K dataset. The key takea wa y is that comp etitive long-horizon navigation can b e obtained by fully lev eraging existing tra jectory corp ora with a carefully designed p ost-training pip eline: action-a w are SFT preserve grounding p erformance under long reasoning traces and conserv ativ e RL that impro v es decision making from partially verifiable feedback by con trolling p olicy drift. Across diverse mobile and w eb b enc hmarks, GUI-Libra achiev es strong offline and online results with fav orable data and parameter efficiency , without relying on exp ensive online in teraction during training. Beyond benchmark gains, w e analyze the effects of k ey design c hoices in ASFT and RL, and show that GUI-Libra m ak es offline ev aluation more reliable and more predictive of online task success, an imp ortant prop erty for real-w orld deploymen t. W e hop e our findings and released resources will encourage further work on data-efficient and reliable learning framew orks for interactiv e GUI agents in real-w orld settings. Limitations Our w ork fo cuses on learning from existing op en-source datasets. While this setting is meaningful and our results suggest that current tra jectory corpora still ha v e substantial untapped p otential, w e train on a relativ ely limited amoun t of data and do not explore how to extend the framework to fully online, interactiv e training. As more large-scale op en-source GUI interaction data b ecome av ailable ( He et al. , 2025 ; W ang et al. , 2025d ; Zhang et al. , 2025a ), scaling our pip eline to incorp orate broader and more div erse tra jectories is a promising direction. In addition, fully online RL can b e expensive, slow, and t ypically requires robust infrastructure and careful sys tem design. W e leav e a systematic study of extending our framew ork to fully online scheme as future work. Ackno wledgments The authors thank Hao Bai, Chenlu Y e, and Xiao Y u for v aluable discussions, and Boyu Gou and Yiheng Xu for guidance on b enchmark and mo del ev aluation. References Ahmed Aw adallah, Y ash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Y ash Pandy a, Aravind Ra jesw aran, Corby Rosset, Alexey T aymano v, Vibha v Vineet, Sp encer Whitehead, and Andrew Zhao. F ara-7b: An efficient agentic mo del for computer use. , 2025. Chongy ang Bai, Xiaoxue Zang, Ying Xu, Sriniv as Sunk ara, Abhinav Rastogi, Jindong Chen, and Blaise Aguera y Arcas. Uib ert: Learning generic multimodal representations for ui understanding, 2021. Hao Bai, Yifei Zhou, Mert Cemri, Jiayi P an, Alane Suhr, Sergey Levine, and Aviral Kumar. Digirl: T raining in-the-wild device-con trol agents with autonomous reinforcement learning, 2024a. URL https: //arxiv . org/abs/2406 . 11896 . Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: T raining in-the-wild device-control agents with autonomous reinforcement learning. In The Thirty-eighth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2024b. URL https://openreview . net/forum?id= 4XTvXMSZPO . Hao Bai, Alexey T aymano v, T ong Zhang, Aviral Kumar, and Sp encer Whitehead. W ebgym: Scaling training en vironmen ts for visual web agents with realistic tasks. arXiv pr eprint arXiv:2601.02439 , 2026. 24 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Sh uai Bai, Y uxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, W ei Ding, Chang Gao, Chunjiang Ge, W en bin Ge, Zhifang Guo, Qidong Huang, Jie Huang, F ei Huang, Biny uan Hui, Sh utong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Juny ang Lin, Xuejing Liu, Jia w ei Liu, Chenglong Liu, Y ang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sib o Song, Y uc hong Sun, Jun T ang, Jianhong T u, Jianqiang W an, Peng W ang, P engfei W ang, Qiuyue W ang, Y uxuan W ang, Tian bao Xie, Yiheng Xu, Haiy ang Xu, Jin Xu, Zhibo Y ang, Mingkun Y ang, Jianxin Y ang, An Y ang, Bow en Y u, F ei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, F an Zhou, Jing Zhou, Y uanzhi Zh u, and Ke Zh u. Qw en3-vl technical rep ort, 2025a. URL https://arxiv . org/abs/2511 . 21631 . Sh uai Bai, Keqin Chen, Xuejing Liu, Jialin W ang, W en bin Ge, Sib o Song, Kai Dang, Peng W ang, Shijie W ang, Jun T ang, et al. Qw en2. 5-vl technical rep ort. arXiv pr eprint arXiv:2502.13923 , 2025b. Y uxiang Chai, Siyuan Huang, Y azhe Niu, Han Xiao, Liang Liu, Guozhi W ang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. AMEX: Android multi-annotation exp o dataset for mobile GUI agents. In Findings of the Asso ciation for Computational Linguistics: ACL 2025 , 2025. URL https://aclanthology . org/ 2025 . findings- acl . 110/ . Han y ang Chen, Mark Zhao, Rui Y ang, Qinw ei Ma, Ke Y ang, Jiarui Y ao, Kangrui W ang, Hao Bai, Zhenhailong W ang, Rui P an, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, and T ong Zhang. Era: T ransforming vlms into embo died agents via embo died prior learning and online reinforcement learning, 2025a. URL https://arxiv . org/abs/2510 . 12693 . Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong T ong, Quyu Kong, Xu Zhang, Chen Liu, Y uqi Liu, W enxuan W ang, Y ue W ang, Qin Jin, and Steven Hoi. Ui-ins: Enhancing gui grounding with m ulti-p ersp ective instruction-as-reasoning, 2025b. URL https://arxiv . org/abs/2510 . 20286 . W entong Chen, Jun b o Cui, Jinyi Hu, Y ujia Qin, Junjie F ang, Y ue Zhao, Chongyi W ang, Jun Liu, Guirong Chen, Y up eng Huo, Y uan Y ao, Y ank ai Lin, Zhiyuan Liu, and Maosong Sun. GUICourse: F rom general vision language mo del to versatile GUI agent. In Pr o c e e dings of the 63r d A nnual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , 2025c. URL https://aclanthology . org/2025 . acl- long . 1065/ . Kanzhi Cheng, Qiushi Sun, Y ougang Chu, F angzhi Xu, Y an tao Li, Jianbing Zhang, and Zhiyong W u. Seeclick: Harnessing gui grounding for adv anced visual gui agents, 2024a. URL https://arxiv . org/abs/2401 . 10935 . Kanzhi Cheng, Qiushi Sun, Y ougang Ch u, F angzhi Xu, Li Y anT ao, Jian bing Zhang, and Zhiy ong W u. SeeClic k: Harnessing GUI grounding for adv anced visual GUI agents. In Pr o c e e dings of the 62nd Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , 2024b. URL https: //aclanthology . org/2024 . acl- long . 505 . Biplab Dek a, Zifeng Huang, Chad F ranzen, Joshua Hibschman, Daniel Afergan, Y ang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. In Pr o c e e dings of the 30th Annual ACM Symp osium on User Interfac e Softwar e and T e chnolo gy , 2017. URL https://doi . org/10 . 1145/3126594 . 3126651 . Xiang Deng, Y u Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi W ang, Huan Sun, and Y u Su. Mind2w eb: T o wards a generalist agen t for the w eb. In Thirty-seventh Confer enc e on Neur al Information Pr o c essing Systems , 2023. URL https://openreview . net/forum?id=kiYqbO3wqw . Yihe Deng, Hritik Bansal, F an Yin, Nanyun Peng, W ei W ang, and Kai-W ei Chang. Op envlthink er: Complex vision-language reasoning via iterative sft-rl cycles, 2025. URL https://arxiv . org/abs/2503 . 17352 . Lang F eng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group p olicy optimization for llm agent training, 2025. URL https://arxiv . org/abs/2505 . 10978 . Leo Gao, John Sch ulman, and Jacob Hilton. Scaling laws for reward mo del o veroptimization. In International Confer enc e on Machine L e arning , pp. 10835–10866. PMLR, 2023. Bo yu Gou, Ruohan W ang, Boyuan Zheng, Y anan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Y u Su. Na vigating the digital world as humans do: Univ ersal visual grounding for GUI agents. In The Thirte enth 25 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL International Confer enc e on L e arning R epr esentations , 2025. URL https://openreview . net/forum?id= kxnoqaisCT . Zhangxuan Gu, Zhengwen Zeng, Zhen yu Xu, Xingran Zhou, Sh uheng Shen, Y unfei Liu, Beitong Zhou, Changh ua Meng, Tianyu Xia, W eizhi Chen, Y ue W en, Jingya Dou, F ei T ang, Jinzhen Lin, Y ulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Y uan Guo, Y ong Deng, Zhen yu Guo, Liang Chen, and W eiqiang W ang. Ui-v enus technical rep ort: Building high-p erformance ui agents with rft, 2025. URL https://arxiv . org/abs/2508 . 10833 . Yifei He, Pranit Cha wla, Y aser Souri, Subho jit Som, and Xia Song. Scalable data synthesis for computer use agen ts with step-level filtering. arXiv pr eprint arXiv:2512.10962 , 2025. W enyi Hong, W eihan W ang, Qingsong Lv, Jiazheng Xu, W enmeng Y u, Junhui Ji, Y an W ang, Zihan W ang, Y uxiao Dong, Ming Ding, and Jie T ang. Cogagen t: A visual language mo del for gui agents, 2023. W enyi Hong, W eihan W ang, Qingsong Lv, Jiazheng Xu, W enmeng Y u, Junhui Ji, Y an W ang, Zihan W ang, Y uxuan Zhang, Juanzi Li, Bin Xu, Y uxiao Dong, Ming Ding, and Jie T ang. Cogagent: A visual language mo del for gui agents, 2024. URL https://arxiv . org/abs/2312 . 08914 . W enyi Hong, W enmeng Y u, Xiaotao Gu, Guo W ang, Guobing Gan, Haomiao T ang, Jiale Cheng, Ji Qi, Junh ui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: T ow ards versatile m ultimo dal reasoning with scalable reinforcemen t learning. arXiv pr eprint arXiv:2507.01006 , 2025. Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Y uc hen Tian, Jing Ma, Zhiyong Huang, and T at-Seng Ch ua. Screensp ot-pro: Gui grounding for professional high-resolution computer use. In Pr o c e e dings of the 33r d ACM International Confer enc e on Multime dia , pp. 8778–8786, 2025. W ei Li, William E Bishop, Alice Li, Christopher Ra wle s, F olawiy o Campb ell-Ajala, Divya Ty amagundlu, and Oriana Riv a. On the effects of data scale on UI control agents. In The Thirty-eight Confer enc e on Neur al Information Pr o c essing Systems Datasets and Benchmarks T r ack , 2024. URL https://openreview . net/ forum?id=yUEBXN3cvX . Y ang Li, Jiacong He, Xin Zhou, Y uan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile UI action sequences. In Pr o c e e dings of the 58th Annual Me eting of the Asso ciation for Computational Linguistics , 2020a. URL https://aclanthology . org/2020 . acl- main . 729/ . Y ang Li, Gang Li, Luheng He, Jing jie Zheng, Hong Li, and Zhiw ei Guan. Widget captioning: Generating natural language description for mobile user interface elemen ts. In Pr o c e e dings of the 2020 Confer enc e on Empiric al Metho ds in Natur al L anguage Pr o c essing (EMNLP) , 2020b. URL https://aclanthology . org/ 2020 . emnlp- main . 443/ . Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Y ang, Shiw ei W u, Zec hen Bai, W eixian Lei, Lijuan W ang, and Mik e Zheng Shou. Sho wui: One vision-language-action mo del for gui visual agent, 2024. URL https://arxiv . org/abs/2411 . 17465 . Jijia Liu, F eng Gao, Bingwen W ei, Xinlei Chen, Qingmin Liao, Yi W u, Chao Y u, and Y u W ang. What can rl bring to vla generalization? an empirical study , 2026. URL https://arxiv . org/abs/2505 . 19789 . Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Han yu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi W ang, e t al. Autoglm: Autonomous foundation agen ts for guis. arXiv pr eprint arXiv:2411.00820 , 2024. Xiao Liu, Tianjie Zhang, Y u Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Han yu Lai, Jiadai Sun, Xinyue Y ang, Y u Y ang, Zehan Qi, Shun tian Y ao, Xueqiao Sun, Siyi Cheng, Qink ai Zheng, Hao Y u, Hanc hen Zhang, W enyi Hong, Ming Ding, Lihang P an, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Y u Su, Y uxiao Dong, and Jie T ang. Visualagentbench: T ow ards large m ultimo dal mo dels as visual foundation agents. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025a. URL https://openreview . net/forum?id=2snKOc7TVp . Y uhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congk ai Xie, Jiasheng W ang, Xueyu Hu, Xiaotian Han, 26 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Jian b o Y uan, Xiny ao W ang, et al. Infigui-g1: Adv ancing gui grounding with adaptiv e exploration p olicy optimization. arXiv pr eprint arXiv:2508.05731 , 2025b. Zhao y ang Liu, Jing jing Xie, Zichen Ding, Zehao Li, Bo w en Y ang, Zhen yu W u, Xueh ui W ang, Qiushi Sun, Shi Liu, W eiyun W ang, Shenglong Y e, Qingyun Li, Xuan Dong, Y ue Y u, Chenyu Lu, Y unXiang Mo, Y ao Y an, Zeyue Tian, Xiao Zhang, Y uan Huang, Yiqian Liu, W eijie Su, Gen Luo, Xiangyu Y ue, Biqing Qi, Kai Chen, Bow en Zhou, Y u Qiao, Qifeng Chen, and W enhai W ang. Scalecua: Scaling open- source computer use agents with cross-platform data. arXiv pr eprint arXiv:2509.15221 , 2025c. URL https://github . com/OpenGVLab/ScaleCUA . Preprint. Zic hen Liu, Changyu Chen, W enjun Li, Pengh ui Qi, Tianyu Pang, Chao Du, W ee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical p ersp ective. arXiv pr eprint arXiv:2503.20783 , 2025d. Quanfeng Lu, W enqi Shao, Zitao Liu, F anqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaip eng Zhang, Y u Qiao, and Ping Luo. Gui o dyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. arXiv pr eprint arXiv:2406.08451 , 2024. Zhengxi Lu, Y uxiang Chai, Y axuan Guo, Xi Yin, Liang Liu, Hao W ang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents b y reinforcemen t learning. arXiv pr eprint arXiv:2503.21620 , 2025. Run Luo, Lu W ang, W anw ei He, and Xiaob o Xia. Gui-r1: A generalist r1-style vision-language action mo del for gui agents. arXiv pr eprint arXiv:2504.10458 , 2025. Jian Mu, Chao yun Zhang, Chiming Ni, Lu W ang, Bo Qiao, Kartik Mathur, Qianhui W u, Y uhang Xie, Xiao jun Ma, Mengyu Zhou, Si Qin, Liqun Li, Y u Kang, Minghua Ma, Qingwei Lin, Sarav an Ra jmohan, and Dongmei Zhang. Gui-360 ◦ : A comprehensive dataset and b enchmark for computer-using agen ts, 2025. URL https://arxiv . org/abs/2511 . 04307 . Shra v an Nay ak, Xiangru Jian, Kevin Qinghong Lin, Juan A. Ro driguez, Montek Kalsi, Nicolas Chapados, M. T amer ¨ Ozsu, Aish wary a Agraw al, David V azquez, Christopher Pal, Perouz T aslakian, Spandana Gella, and Sai Ra jeswar. UI-vision: A desktop-centric GUI b enchmark for visual p erception and in teraction. In F orty-se cond International Confer enc e on Machine L e arning , 2025. URL https://openreview . net/ forum?id=5Rtj4mYH1C . Op enAI, :, Aaron Hurst, Adam Lerer, Adam P . Goucher, Adam Perelman, Adit y a Ramesh, Aidan Clark, AJ Ostrow, Akila W elihinda, Alan Ha y es, Alec Radford, Aleksander Madry , Alex Baker-Whitcom b, Alex Beutel, Alex Borzuno v, Alex Carney , Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex T achard Passos, Alexander Kirillo v, Alexi Christakis, Alexis Conneau, Ali Kamali, Allan Jabri, Allison Mo y er, Allison T am, Amadou Cro okes, Amin T o oto o chian, Amin T ooto onchian, Anany a Kumar, Andrea V allone, Andrej Karpath y , Andrew Braunstein, Andrew Cann, Andrew Co disp oti, Andrew Galu, Andrew Kondric h, Andrew T ulloch, Andrey Mishchenk o, Angela Baek, Angela Jiang, Antoine P elisse, Antonia W o o dford, Anuj Gosalia, Ark a Dhar, Ashley Pan tuliano, Avi Nay ak, Avital Oliv er, Barret Zoph, Behro oz Ghorbani, Ben Leimberger, Ben Rossen, Ben Sokolo wsky , Ben W ang, Benjamin Zweig, Beth Hoov er, Blak e Samic, Bob McGrew, Bobby Sp ero, Bogo Giertler, Bow en Cheng, Brad Lightcap, Brandon W alkin, Brendan Quinn, Brian Guarraci, Brian Hsu, Bright Kellogg, Brydon Eastman, Camillo Lugaresi, Carroll W ainwrigh t, Cary Bassin, Cary Hudson, Casey Chu, Chad Nelson, Chak Li, Chan Jun Shern, Channing Conger, Charlotte Barette, Chelsea V oss, Chen Ding, Cheng Lu, Chong Zhang, Chris Beaumont, Chris Hallacy , Chris Ko c h, Christian Gibson, Christina Kim, Christine Choi, Christine McLea vey , Christopher Hesse, Claudia Fischer, Clemens Win ter, Coley Czarnecki, Colin Jarvis, Colin W ei, Constantin Koumouzelis, Dane Sherburn, Daniel Kappler, Daniel Levin, Daniel Levy , David Carr, David F arhi, David Mely , David Robinson, David Sasaki, Denny Jin, Dev V alladares, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findla y , Edede Oiwoh, Edmund W ong, Ehsan Asdar, Elizab eth Pro ehl, Elizab eth Y ang, Eric Antono w, Eric Kramer, Eric Peterson, Eric Sigler, Eric W allace, Eugene Brevdo, Ev an Mays, F arzad Khorasani, F elip e Petroski Such, Filipp o Raso, F rancis Zhang, F red von Lohmann, F reddie Sulit, Gabriel Goh, Gene Oden, Geoff Salmon, Giulio Starace, Greg Bro ckman, Hadi Salman, Haiming Bao, Haitang Hu, Hannah W ong, Haoyu W ang, Heather Schmidt, Heather Whitney , Heewoo Jun, Hendrik Kirchner, Henrique Ponde de Oliveira Pinto, Hongyu Ren, Huiwen Chang, Hyung W on Ch ung, Ian Kivlichan, Ian O’Connell, Ian 27 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL O’Connell, Ian Osband, Ian Silb er, Ian Sohl, Ibrahim Okuyucu, Ik ai Lan, Ilya Kostrik ov, Ilya Sutskev er, Ingmar Kanitsc heider, Ishaan Gulra jani, Jacob Coxon, Jacob Menick, Jakub Pac ho cki, James Aung, James Betk er, James Cro oks, James Lennon, Jamie Kiros, Jan Leik e, Jane Park, Jason Kwon, Jason Phang, Jason T eplitz, Jason W ei, Jason W olfe, Jay Chen, Jeff Harris, Jenia V aravv a, Jessica Gan Lee, Jessica Shieh, Ji Lin, Jiah ui Y u, Jiayi W eng, Jie T ang, Jieqi Y u, Joanne Jang, Joaquin Quinonero Candela, Jo e Beutler, Joe Landers, Jo el Parish, Johannes Heideck e, John Sch ulman, Jonathan Lachman, Jonathan McKay , Jonathan Uesato, Jonathan W ard, Jong W ook Kim, Jo ost Huizinga, Jordan Sitkin, Jos Kraaijeveld, Josh Gross, Josh Kaplan, Josh Sn yder, Joshua Achiam, Joy Jiao, Joyce Lee, Juntang Zhuang, Justyn Harriman, Kai F rick e, Kai Ha yashi, Karan Singhal, Katy Shi, Kavin Karthik, Kayla W oo d, Kendra Rimbac h, Kenny Hsu, Kenny Nguy en, Keren Gu-Lemberg, Kevin Button, Kevin Liu, Kiel How e, Krithik a Muthukumar, Kyle Luther, Lama Ahmad, Larry Kai, Lauren Ito w, Lauren W orkman, Leher Pathak, Leo Chen, Li Jing, Lia Guy , Liam F edus, Liang Zhou, Lien Mamitsuk a, Lilian W eng, Lindsay McCallum, Lindsey Held, Long Ouyang, Louis F euvrier, Lu Zhang, Luk as Kondraciuk, Luk asz Kaiser, Luke Hewitt, Luke Metz, Lyric Doshi, Mada Aflak, Maddie Simens, Madelaine Boyd, Madeleine Thompson, Marat Dukhan, Mark Chen, Mark Gra y , Mark Hudnall, Marvin Zhang, Marw an Aljub eh, Mateusz Litwin, Matthew Zeng, Max Johnson, May a Shett y , May ank Gupta, Meghan Shah, Mehmet Y atbaz, Meng Jia Y ang, Mengchao Zhong, Mia Glaese, Mianna Chen, Michael Janner, Michael Lamp e, Michael Petro v, Michael W u, Mic hele W ang, Michelle F radin, Michelle Pokrass, Miguel Castro, Miguel Oom T emudo de Castro, Mikhail Pa vlov, Miles Brundage, Miles W ang, Minal Khan, Mira Murati, Mo Bav arian, Molly Lin, Murat Y esildal, Nacho Soto, Natalia Gimelshein, Natalie Cone, Natalie Staudacher, Natalie Summers, Natan LaF on taine, Neil Chowdh ury , Nick Ryder, Nick Stathas, Nick T urley , Nik T ezak, Niko F elix, Nithanth Kudige, Nitish Kesk ar, Noah Deutsc h, No el Bundick, Nora Puck ett, Ofir Nach um, Ola Okelola, Oleg Boiko, Oleg Murk, Oliver Jaffe, Olivia W atkins, Olivier Go dement, Owen Campb ell-Mo ore, Patric k Chao, P aul McMillan, Pa vel Belov, Peng Su, P eter Bak, Peter Bakkum, P eter Deng, Peter Dolan, Peter Ho eschele, Peter W elinder, Phil Tillet, Philip Pronin, Philipp e Tillet, Prafulla Dhariwal, Qiming Y uan, Rachel Dias, Rachel Lim, Rahul Arora, Ra jan T roll, Randall Lin, Rapha Gontijo Lop es, Raul Puri, Reah Miyara, Reimar Leik e, Renaud Gaub ert, Reza Zamani, Ric ky W ang, Rob Donnelly , Rob Honsb y , Ro cky Smith, Rohan Sahai, Rohit Ramchandani, Romain Huet, Rory Carmichael, Ro wan Zellers, Ro y Chen, Rub y Chen, Ruslan Nigmatullin, Ryan Cheu, Saachi Jain, Sam Altman, Sam Schoenholz, Sam T oizer, Samuel Miserendino, Sandhini Agarw al, Sara Culv er, Scott Ethersmith, Scott Gray , Sean Grov e, Sean Metzger, Shamez Hermani, Shan tanu Jain, Sheng jia Zhao, Sherwin W u, Shino Jomoto, Shirong W u, Shuaiqi, Xia, Sonia Phene, Sp encer Papa y , Sriniv as Naray anan, Stev e Coffey , Steve Lee, Stewart Hall, Suchir Bala ji, T al Bro da, T al Stramer, T ao Xu, T arun Gogineni, T ay a Christianson, T ed Sanders, T ejal P at w ardhan, Thomas Cunninghman, Thomas Degry , Thomas Dimson, Thomas Raoux, Thomas Shadw ell, Tianhao Zheng, T odd Underw o o d, T o dor Mark o v, T oki Sherbak o v, T om Rubin, T om Stasi, T omer Kaftan, T ristan Heywoo d, T ro y Peterson, Tyce W alters, Tyna Eloundou, V alerie Qi, V eit Mo eller, Vinnie Monaco, Vishal Kuo, Vlad F omenko, W a yne Chang, W eiyi Zheng, W enda Zhou, W esam Manassra, Will Sheu, W o jciech Zaremba, Y ash Patil, Yilei Qian, Y ongjik Kim, Y oulong Cheng, Y u Zhang, Y uchen He, Y uchen Zhang, Y ujia Jin, Y unxing Dai, and Y ury Malk o v. Gpt-4o system card, 2024. URL https://arxiv . org/abs/2410 . 21276 . Long Ouyang, Jeffrey W u, Xu Jiang, Diogo Almeida, Carroll W ain wright, Pamela Mishkin, Chong Zhang, Sandhini Agarw al, Katarina Slama, Alex Ra y , et al. T raining language mo dels to follow instructions with h uman feedback. A dvanc es in neur al information pr o c essing systems , 35:27730–27744, 2022. Prana v Putta, Edm und Mills, Naman Garg, Sumeet Motw ani, Chelsea Finn, Divy ansh Garg, and Rafael Rafailo v. Agent q: Adv anced reasoning and learning for autonomous ai agen ts, 2024. URL https: //arxiv . org/abs/2408 . 07199 . Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, W enyi Zhao, Y u Y ang, Xin yue Y ang, Jiadai Sun, Sh un tian Y ao, Tianjie Zhang, W ei Xu, Jie T ang, and Y uxiao Dong. W ebrl: T raining llm web agents via self-ev olving online curriculum reinforcement learning, 2025. URL https://arxiv . org/abs/2411 . 02337 . Y ujia Qin, Yining Y e, Junjie F ang, Haoming W ang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Y unxin Li, Shijue Huang, W anjun Zhong, Kuany e Li, Jiale Y ang, Y u Miao, W o yu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiao jun Xiao, Kai Cai, Chuang Li, Y ao wei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao W ang, Haoli Chen, Zhao jian Li, Haihua Y ang, Haifeng Liu, F eng Lin, T ao Peng, 28 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Xin Liu, and Guang Shi. Ui-tars: Pioneering automated gui in teraction with native agen ts, 2025a. URL https://arxiv . org/abs/2501 . 12326 . Y ujia Qin, Yining Y e, Junjie F ang, Haoming W ang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Y unxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui in teraction with native agen ts. arXiv pr eprint arXiv:2501.12326 , 2025b. Christopher Rawles, Alice Li, Daniel Ro driguez, Oriana Riv a, and Timoth y Lillicrap. Android in the wild: a large-scale dataset for android device control. In Pr o c e e dings of the 37th International Confer enc e on Neur al Information Pr o c essing Systems , NIPS ’23, 2023. Christopher Rawles, Sarah Clinck emaillie, Yifan Chang, Jonathan W altz, Gabrielle Lau, Maryb eth F air, Alice Li, William E Bishop, W ei Li, F olawiy o Campb ell-Ajala, Daniel Kenji T o yama, Rob ert James Berry , Divya T y amagundlu, Timothy P Lillicrap, and Oriana Riv a. Androidworld: A dynamic b enchmarking en vironment for autonomous agents. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025. URL https://openreview . net/forum?id=il5yUQsrjC . Zhihong Shao, Peiyi W ang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haow ei Zhang, Mingch uan Zhang, YK Li, Y ang W u, et al. Deepseekmath: Pushing the limits of mathematical reasoning in op en language mo dels. arXiv pr eprint arXiv:2402.03300 , 2024. Qiushi Sun, Kanzhi Cheng, Zic hen Ding, Ch uan y ang Jin, Yian W ang, F angzhi Xu, Zhenyu W u, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent tra jectory construction via rev erse task synthesis. arXiv pr eprint arXiv:2412.19723 , 2024. F ei T ang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, W en W ang, W enqi Zhang, Y ongliang Shen, W eiming Lu, Jun Xiao, and Y ueting Zhuang. Gui-g 2 : Gaussian reward mo deling for gui grounding, 2025. URL https://arxiv . org/abs/2507 . 15846 . Haoming W ang, Haoy ang Zou, Huatong Song, Jiazhan F eng, Junjie F ang, Junting Lu, Longxiang Liu, Qin yu Luo, Shihao Liang, Shijue Huang, W anjun Zhong, Yining Y e, Y ujia Qin, Y uw en Xiong, Y uxin Song, Zhiyong W u, Aoy an Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, F uxing Leng, Hanbin W ang, Hao Y u, Haobin Chen, Hongyi Guo, Jing Su, Jing jia Huang, Kai Shen, Kaiyu Shi, Lin Y an, Peiy ao Zhao, Pengfei Liu, Qinghao Y e, Renjie Zheng, Shulin Xin, W a yne Xin Zhao, W en Heng, W enhao Huang, W enqian W ang, Xiaob o Qin, Yi Lin, Y oubin W u, Zehui Chen, Zihao W ang, Bao quan Zhong, Xinch un Zhang, Xujing Li, Y uanfan Li, Zhongk ai Zhao, Chengquan Jiang, F aming W u, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, W enqi F u, Xin Liu, Y aohui W ang, Zhi Zhang, Bo Zhou, Guoliang Li, Jia jun Shi, Jiale Y ang, Jie T ang, Li Li, Qihua Han, T aoran Lu, W oyu Lin, Xiaok ang T ong, Xiny ao Li, Yichi Zhang, Y u Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, F eng Lin, Ge Zhang, Haih ua Y ang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuan ye Li, Lichen Y uan, Meilan Han, Minchao W ang, Shuyue Guo, Tianhao Cheng, Xiaob o Ma, Xiao jun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen W ang, Zhao jian Li, Zhenzhu Y ang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, and Guang Shi. Ui-tars-2 technical rep ort: Adv ancing gui agent with multi-turn reinforcement learning, 2025a. URL https://arxiv . org/abs/2509 . 02544 . Jiaqi W ang, Kevin Qinghong Lin, James Cheng, and Mike Zheng Shou. Think or not? selectiv e reasoning via reinforcemen t learning for vision-language mo dels. arXiv pr eprint arXiv:2505.16854 , 2025b. Kangrui W ang, Pingyue Zhang, Zihan W ang, Y aning Gao, Linjie Li, Qineng W ang, Han y ang Chen, Chi W an, Yiping Lu, Zhengyuan Y ang, Lijuan W ang, Ranja y Krishna, Jia jun W u, Li F ei-F ei, Y ejin Choi, and Manling Li. V agen: Reinforcing w orld model reasoning for multi-turn vlm agents, 2025c. URL https://arxiv . org/abs/2510 . 16907 . Xin yuan W ang, Bow en W ang, Dunjie Lu, Junlin Y ang, Tian bao Xie, Junli W ang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry W u, Zhennan Shen, Zhuok ai Li, Ryan Li, Xiaoch uan Li, Junda Chen, Zheng Bo yuan, LI PEIHANG, F angyu Lei, Ruisheng Cao, Y eqiao F u, Dongc han Shin, Martin Shin, Hu Jiarui, Y uyan W ang, Jixuan Chen, Y uxiao Y e, Dany ang Zhang, Yipu W ang, Heng W ang, Diyi Y ang, Victor Zhong, Y.Charles, Zhilin Y ang, and T ao Y u. OpenCUA: Op en foundations for computer-use 29 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL agen ts. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025d. URL https://openreview . net/forum?id=6iRZvJiC9Q . Y ufei W ang, W anjun Zhong, Liangyou Li, F ei Mi, Xingshan Zeng, W en yong Huang, Lifeng Shang, Xin Jiang, and Qun Liu. Aligning large language mo dels with human: A survey . arXiv pr eprint arXiv:2307.12966 , 2023. Jason W u, Siyan W ang, Siman Shen, Yi-Hao P eng, Jeffrey Nic hols, and Jeffrey Bigham. W ebui: A dataset for enhancing visual ui understanding with w eb semantics. ACM Confer enc e on Human F actors in Computing Systems (CHI) , 2023. Qianh ui W u, Kanzhi Cheng, Rui Y ang, Chao yun Zhang, Jianw ei Y ang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reub en T an, et al. Gui-actor: Co ordinate-free visual grounding for gui agen ts. arXiv pr eprint arXiv:2506.03143 , 2025a. Zhiy ong W u, Zhenyu W u, F angzhi Xu, Yian W ang, Qiushi Sun, Chengy ou Jia, Kanzhi Cheng, Zic hen Ding, Liheng Chen, Paul Pu Liang, and Y u Qiao. Os-atlas: A foundation action mo del for generalist gui agen ts, 2024. URL https://arxiv . org/abs/2410 . 23218 . Zhiy ong W u, Zhen yu W u, F angzhi Xu, Yian W ang, Qiushi Sun, Chengy ou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, P aul Pu Liang, and Y u Qiao. OS-A TLAS: F oundation action mo del for generalist GUI agents. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025b. URL https://openreview . net/forum?id=n9PDaFNi8t . Tian bao Xie, Jiaqi Deng, Xiao ch uan Li, Junlin Y ang, Hao yuan W u, Jixuan Chen, W enjing Hu, Xinyuan W ang, Y uhui Xu, Zekun W ang, Yiheng Xu, Junli W ang, Doy en Saho o, T ao Y u, and Caiming Xiong. Scaling computer-use grounding via user interface decomp osition and synthesis, 2025. URL https: //arxiv . org/abs/2505 . 13227 . Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli W ang, Zekun W ang, Y uchen Mao, Caiming Xiong, and T ao Y u. Agen ttrek: Agen t tra jectory synthesis via guiding replay with w eb tutorials. In The Thirte enth International Confer enc e on L e arning R epr esentations , 2025a. URL https://openreview . net/forum?id=EEgYUccwsV . Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli W ang, Zekun W ang, Y uc hen Mao, Caiming Xiong, and T ao Y u. Agenttrek: Agent tra jectory synth esis via guiding replay with web tutorials, 2025b. URL https://arxiv . org/abs/2412 . 09605 . Yiheng Xu, Zekun W ang, Junli W ang, Dunjie Lu, Tian bao Xie, Amrita Saha, Do yen Sahoo, T ao Y u, and Caiming Xiong. Aguvis: Unified pure vision agen ts for autonomous GUI in teraction. In F orty- se c ond International Confer enc e on Machine L e arning , 2025c. URL https://openreview . net/forum?id= PlihOwfx4r . Tianci Xue, W eijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Y u Su. An illusion of progress? assessing the curren t state of web agents. In Se c ond Confer enc e on L anguage Mo deling , 2025. URL https://openreview . net/forum?id=6jZi4HSs6o . Qi Y ang, W eic hen Bi, Haiyang Shen, Y ao qi Guo, and Y un Ma. Pixelweb: The first web gui dataset with pixel-wise lab els, 2025a. URL https://arxiv . org/abs/2504 . 16419 . Rui Y ang, Ruomeng Ding, Y ong Lin, Huan Zhang, and T ong Zhang. Regularizing hidden states enables learning generalizable reward mo del for llms. A dvanc es in Neur al Information Pr o c essing Systems , 37: 62279–62309, 2024a. Rui Y ang, Xiaoman P an, F eng Luo, Shuang Qiu, Han Zhong, Dong Y u, and Jianshu Chen. Rew ards-in- con text: Multi-ob jectiv e alignment of foundation mo dels with dynamic preference adjustment. arXiv pr eprint arXiv:2402.10207 , 2024b. Y an Y ang, Dongxu Li, Y utong Dai, Y uhao Y ang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Sav arese, Caiming Xiong, and Junnan Li. Gta1: Gui test-time s caling agent, 2025b. URL https://arxiv . org/abs/2507 . 05791 . 30 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Y uhao Y ang, Y ue W ang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. arXiv pr eprint arXiv:2412.16256 , 2024c. Zhen Y ang, Zi-Yi Dou, Di F eng, F orrest Huang, Anh Nguyen, Keen Y ou, Omar Attia, Y uhao Y ang, Michael F eng, Haotian Zhang, Ram Ramrakhy a, Chao Jia, Jeffrey Nichols, Alexander T oshev, Yinfei Y ang, and Zhe Gan. F erret-ui lite: Lessons from building small on-device gui agents, 2025c. URL https://arxiv . org/ abs/2509 . 26539 . Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiao chen Zuo, Y u Y ue, W einan Dai, Tian tian F an, Gaohong Liu, Ling jun Liu, et al. Dapo: An op en-source llm reinforcemen t learning system at scale. arXiv pr eprint arXiv:2503.14476 , 2025. Y uexiang Zhai, Hao Bai, Zip eng Lin, Jiayi P an, Shengbang T ong, Yifei Zhou, Alane Suhr, Saining Xie, Y ann LeCun, Yi Ma, and Sergey Levine. Fine-tuning large vision-language mo dels as decision-making agen ts via reinforcemen t learning, 2024. URL https://arxiv . org/abs/2405 . 10292 . Qiusi Zhan, Hy eonjeong Ha, Rui Y ang, Sirui Xu, Hany ang Chen, Liang-Y an Gui, Y u-Xiong W ang, Huan Zhang, Heng Ji, and Daniel Kang. Visual backdoor attacks on mllm em b o died decision making via contrastiv e trigger learning. arXiv pr eprint arXiv:2510.27623 , 2025. Bofei Zhang, Zirui Shang, Zhi Gao, W ang Zhang, Rui Xie, Xiao jian Ma, T ao Y uan, Xinxiao W u, Song-Chun Zh u, and Qing Li. T ongui: Internet-scale tra jectories from multimodal web tutorials for generalized gui agen ts. In Pr o c e e dings of the AAAI Confer enc e on Artificial Intel ligenc e , 2025a. Jiw en Zhang, Jihao W u, T eng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu W ei, and Duyu T ang. Android in the zo o: Chain-of-action-though t for GUI agents. In Findings of the Asso ciation for Computational Linguistics: EMNLP 2024 , 2024. URL https://aclanthology . org/2024 . findings- emnlp . 702/ . Kai Zhang, Xiangc hao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao W ang, Y uting Ning, Zhaorun Chen, Xiaohan F u, et al. Agen t learning via early exp erience. arXiv pr eprint arXiv:2510.08558 , 2025b. Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Y ang, Chong Luo, Tianyi Chen, Justin W agle, Tim F ranklin, et al. Phi-ground tech rep ort: Adv ancing p erception in gui grounding. arXiv pr eprint arXiv:2507.23779 , 2025c. Bo yuan Zheng, Bo yu Gou, Jih yung Kil, Huan Sun, and Y u Su. Gpt-4v(ision) is a generalist web agent, if grounded. 2024. URL https://openreview . net/forum?id=piecKJ2DlB . Ch ujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bow en Y u, Chang Gao, Kai Dang, Y uqiong Liu, Rui Men, An Y ang, et al. Group sequence p olicy optimization. arXiv pr eprint arXiv:2507.18071 , 2025a. Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun W ang, Bo An, and Shuic heng Y AN. Agentstudio: A to olkit for building general virtual agents. In The Thirte enth International Confer enc e on L earning R epr esentations , 2025b. URL https://openreview . net/forum?id=axUf8BOjnH . Hanzhang Zhou, Xu Zhang, Panrong T ong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Y ue W ang, Jingren Zhou, and Stev en Hoi. Mai-ui technical rep ort: Real-world cen tric foundation gui agen ts, 2025a. URL https://arxiv . org/abs/2512 . 22047 . Y uqi Zhou, Sunhao Dai, Shuai W ang, Kaiw en Zhou, Qinglin Jia, and Jun Xu. GUI-g1: Understanding r1-zero-lik e training for visual grounding in GUI agen ts. In The Thirty-ninth Annual Confer enc e on Neur al Information Pr o c essing Systems , 2025b. URL https://openreview . net/forum?id=1XLjrmKZ4p . Xin yu Zh u, Mengzhou Xia, Zhep ei W ei, W ei-Lin Chen, Danqi Chen, and Y u Meng. The surprising effectiveness of negative reinforcement in llm reasoning. arXiv pr eprint arXiv:2506.01347 , 2025. A Additional Related W o rks R einfor c ement L e arning fr om V erifiable R ewar ds (RL VR) T o optimize LLMs, learning a reward mo del as the training signal is a common practice in reinforcement learning from human feedback (RLHF) ( Ouyang et al. , 2022 ; W ang et al. , 2023 ; Y ang et al. , 2024b ). Ho wev er, reward mo dels are known to suffer from rew ard 31 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL hac king and misalignment issues ( Gao et al. , 2023 ; Y ang et al. , 2024a ). T o address these limitations, recent w ork has shifted tow ard verifiable r ewar ds , where sup ervision is derived from ground-truth verifiers, such as exact mathematical answ er matching or co de execution, rather than from a learned and p otentially ambiguous rew ard mo del. Shao et al. ( 2024 ) introduce GRPO, which enables effectiv e p olicy optimization from such v erifiable signals and demonstrates the emergence of complex reasoning b eha viors. Building on this framework, D APO ( Y u et al. , 2025 ) and Dr. GRPO ( Liu et al. , 2025d ) prop ose simple yet effective tec hniques, such as more aggressive clipping and dynamic sampling, to stabilize training and mitigate learning bias in GRPO. More recen tly , GSPO ( Zheng et al. , 2025a ) leverages sequence-level imp ortance sampling to further impro v e training stability , particularly for mixture-of-exp erts mo dels. Despite their success, we find that directly applying RL VR recip es to step-wise agent training often leads to sub optimal p olicies, due to distribution shift and ambiguous intermediate rewards. Post-tr aining for VLM-b ase d A gents. VLMs hav e demonstrated strong capabilities in visual p erception and multimodal reasoning( Op enAI et al. , 2024 ; Bai et al. , 2025b , a ; Zheng et al. , 2024 ). Ho wev er, deploying them as agents in visually grounded environmen ts requires moving b eyond static understanding tow ard robust, long-horizon decision-making. T o bridge the gap, recent research adopts a tw o-stage SFT-then-RL paradigm( Chen et al. , 2025a ; Zhai et al. , 2024 ; Zhan et al. , 2025 ). In the first stage, SFT equips VLMs with essen tial agentic skills, including visual grounding, structured reasoning, and action prediction through curated dataset( Hong et al. , 2024 ; Cheng et al. , 2024a ; W u et al. , 2024 ; Lin et al. , 2024 ; Qin et al. , 2025a ; Xu et al. , 2025b ). Nevertheless, SFT is inherently limited by the cov erage and diversit y of demonstrations, therefore prone to comp ounding errors when encountering out-of-distribution states( Chen et al. , 2025a ; Liu et al. , 2026 ; Deng et al. , 2025 ). RL complements SFT b y enabling agen ts to interact directly with environmen ts. Through exploration, agen ts can learn from b oth successes and failures, gradually developing capabilities such as error reco v ery , self-correction, and long-horizon planning ( Bai et al. , 2024a ; Qi et al. , 2025 ; Putta et al. , 2024 ; F eng et al. , 2025 ; W ang et al. , 2025c ). Under this tw o-stage paradigm, SFT first establishes a stable foundation of core skills, after whic h RL enhances long-horizon decision-making via environmen t in teraction and p olicy optimization. B Implementation Details Action Space. W e model GUI interaction with a unified action space where each step outputs a structured tuple (action type, action target, value, point 2d) . Here, action type sp ecifies the op eration (e.g., Click , Write , Scroll ), action target describ es the target UI elemen t when applicable, and value pro vides additional arguments such as input text, scroll/swipe direction, k ey name, waiting time, or app name. F or actions that require spatial grounding (e.g., Click , LongPress , and optionally Swipe or Write ), point 2d records the screen co ordinate [x,y] ; otherwise it is set to None . This unified schema supp orts b oth w eb and mobile en vironmen ts, including device-level controls (e.g., NavigateBack , NavigateHome , OpenApp ) and a terminal action Terminate . See T able 12 for the full sp ecification. F ollowing the base mo del’s co ordinate system, w e use absolute pixel co ordinates for Qwen2.5-VL-based mo dels, and normalized co ordinates in [0 , 1000] for Qw e n3-VL-based mo dels. SFT. W e summarize our shared SFT and Action-aw are SFT implemen tation parameters in T able 13 . W e apply full parameter tuning on Qwen2.5-VL and Qwen3-VL base mo dels from 3B,4B, to 7B and 8B. W e use a learning rate of 1 × 10 − 5 and an effective batch size of 256. W e train SFT and ASFT mo dels for either t wo ep o chs on GUI-Libra-81K or 1 ep o ch on mixing reasoning and direct-action data (it doubles data size). F or action-aw are SFT, we by default use α a = 2 and α g = 4 for ASFT, except for GUI-Libra-4B, where α a = α g = 1. W e use 8 B200 GPUs for approximately 4 hours for Qwen3-VL-4B and 5.5 hours for Qw en3-VL-8B. RL. W e adopt GRPO ( Shao et al. , 2024 ) as our RL algorithm, implemen ted with the verl framew ork 1 and EasyR1 2 . GUI-Libra is initialized from the SFT/ASFT chec kp oints and further optimized via online rollouts. 1 https://github . com/verl- project/verl 2 https://github . com/hiyouga/EasyR1 32 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 12 Unified action space. Eac h action is a tuple (action type, action target, value, point 2d) . action t yp e action target value point 2d details Answer None answer text [-100,-100] Return the final answer to the user’s question. Click elemen t description None [x,y] T ap/clic k a specific UI element and provide its coordinates. Select element description option value [-100,-100] Select an item in a list or drop down menu. LongPress element description None [x,y] Press-and-hold on a UI element (mobile only) and provide its co ordinates. W rite element description or None input text [x,y] or [-100,-100] Enter text in to a specific input field; if point 2d is [-100,-100] , type at the current fo cus. KeyboardPress None key name (e.g., enter ) [-100,-100] Press a specific key on the keyboard. Scroll None direction ( up / down / left / right ) [-100,-100] Scroll a view/con tainer in the sp ecified direc- tion. Swipe element description or None direction ( up / down / left / right ) [x,y] or [-100,-100] Perform a swip e gesture on a touchscreen in the given direction; provide co ordinates if applica- ble. W ait None seconds [-100,-100] Pause execution for a specified duration to allow UI up dates. NavigateHome None None [-100,-100] Na vigate to the device’s home screen. NavigateBac k None None [-100,-100] Press the system “Back” button. OpenApp None app name [-100,-100] Launch an app by its name (mobile only). T erminate None end-task message [-100,-100] Signal the end of the current task with a final message. A t each iteration, the mo del samples tra jectories, computes step-wise rewards, and up dates the p olicy using the GRPO ob jective. W e train for 300 RL iterations with a learning rate of 1 × 10 − 6 , global batch size 128, and rollout group size n = 8. W e set the KL regularization co efficient to 0.001 b y default and increase it to 0.005 for GUI-Libra-7B to improv e training stabilit y . F or SNGS, we use mo del-sp ecific ( λ 0 , κ ): (0 . 9 , 0 . 5) for 3B, (1 . 4 , − 0 . 5) for 7B, (0 . 5 , 1 . 5) for 4B, and (0 . 5 , 2 . 0) for 8B. T raining is conducted on 8 NVIDIA B200 GPUs and takes approximately 16 hours for Qw en3-VL-4B and 20 hours for Qw en3-VL-8B. Evaluation. F or inference, we use vLLM as the serving back end. W e set the temp erature to 0.0 and top- p to 1.0, and allow up to 1024 completion tokens by default. The system prompt follows App endix F . The a v ailable action list within the system prompt can b e adjusted dep ending on the deploymen t en vironment. F or example, if a mobile en vironmen t do es not support the OpenAPP action, it can be remo v ed from the action list; in this case, the mo del typically resorts to alternative strategies such as scrolling to access the app draw er. F or mo dels trained with mixed direct-action data, w e optionally use an explicit instruction prompt to elicit direct action generation without in termediate reasoning. Specifically , w e can app end the following format instruction after user instruction: The response should be structured in the following format. Make sure the output between and is a valid JSON object. Regarding the key "point_2d", please provide the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]: { "action\_description": "the description of the action to perform, summarized in one sentence", "action\_type": "the type of action to perform. Please follow the system prompt for available actions.", "value": "the input text or direction (’up’, ’down’, ’left’, ’right’) for the ’scroll’ action or the app name for the ’openapp’ action; otherwise, use ’None’", "point\_2d": [x, y] } 33 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Catego ry Configuration Bac kb one Qw en2.5-VL / Qwen3-VL T raining Strategy F ull-parameter fine-tuning Ep o c hs 1 Learning Rate 1 × 10 − 5 Sc heduler Cosine W armup Ratio 0.01 W eight Deca y 0 P er-device Batc h Size 4 Gradien t Accum ulation Steps 8 Effectiv e Batc h Size 256 (8 GPUs) Gradien t Chec kp ointing Enabled T able 13 SFT configuration used in our exp eriments. C Benchma rk Details In our exp eriments, w e adopt a div erse of b enchmarks for ev aluation, mainly fo cus on GUI navigation b enc hmarks that measures step-wise success or task-level completion. W e also use grounding b enchmarks to ev aluate the correctness of grounding after long CoT generation. Details ab out these b enchmarks are as follo ws. C.1 Grounding Benchma rks W e adopt ScreenSpot-V2 ( Cheng et al. , 2024b ; W u et al. , 2025b ) and ScreenSp ot-Pro ( Li et al. , 2025 ) to ev aluate grounding accuracy . Each task pro vides a short instruction sp ecifying the target element or inten t, together with a screenshot from a digital in terface (mobile, desktop, or w eb). The ground-truth target is given as a b ounding b o x, and w e measure success b y whether the mo del’s predicted click co ordinate falls inside the b o x. ScreenSp ot-V2 corrects lab eling errors in the original ScreenSp ot b enchmark and contains 1,269 tasks, with most screenshots b elow 2560 × 1440 resolution. In con trast, ScreenSp ot-Pro includes 1,555 tasks and features substan tially higher-resolution screenshots (up to 5120 × 2880), resulting in denser visual con tent and more fine-grained targets. Ov erall, ScreenSp ot-V2 reflects common UI settings, while ScreenSp ot-Pro stresses precise grounding in high-resolution, information-rich interfaces. C.2 Offline GUI Navigation Benchmarks Multimo dal-Mind2W eb-v2 W e build our b enchmark on Multimo dal-Mind2Web (MM-Mind2W eb) ( Zheng et al. , 2024 ), the multimodal extension of Mind2W eb ( Deng et al. , 2023 ), to ev aluate offline w eb navigation on realistic user tasks. MM-Mind2W eb aligns each step in a h uman demonstration with a webpage screenshot (and the corresp onding HTML/DOM state), forming a golden multi-step tra jectory conditioned on a high-level natural-language instruction ( Deng et al. , 2023 ; Zheng et al. , 2024 ). The test split spans 100+ websites; all w ebpages along the golden tra jectories are cached to support fully offline ev aluation, and tasks are crowdsourced to reflect real user inten ts. As shown in Figure 6 , MM-Mind2W eb represen ts action history as sym b olic records that are neither natural-language descriptions nor aligned with real user interaction. T o address this, we use Qw en3-VL-32B-Instruct to rewrite eac h action in to a natural-language description and use these descriptions as the history con text. The resulting dataset, Multimo dal-Mind2Web-v2 (MM-Mind2W eb-v2), contains three subsets, Cross-T ask, Cross-W ebsite, and Cross-Domain, with 1,328, 1,019, and 1,002 samples, resp ectively . W e rep ort step success rate as the primary metric, which requires b oth correct target grounding ( element ac cur acy ) and correct op eration execution. Operation correctness is measured b y an exact-match F1 score (F1= 1) ov er the serialized action string “ ActionType Value ”, where Value can b e the t yp ed text for Write actions or the app/website identifier for OpenApp actions. 34 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Catego ry Configuration RL T raining F ramew ork VERL Distributed T raining Back end FSDP Inference Engine vLLM Bac kb one Qw en2.5-VL, Qwen3-VL R ol lout Rollout Samples per Prompt 8 Rollout Batc h Size 256 Sampling Strategy T op- p 0.98 T emp erature 1.0 Max Prompt Length 8092 Max Resp onse Length 1500 Optimization T raining Iterations 300 Learning Rate 1 × 10 − 6 Optimizer AdamW (bf16) Global Batc h Size 128 Micro Batc h (Up date) 4 Micro Batc h (Exp erience) 8 Clip Ratio ( ϵ ) 0.2 KL Co efficien t ( β ) 0.001 b y default, and 0.005 for GUI-Libra-7B Algorithm Adv antage Estimator GRPO, GRPO w/ SNGS Rew ard F unction ˜ r ( s, a ) = w fmt r fmt + (1 − w fmt ) r acc , w fmt = 0 . 1 T able 14 RL configuration for our exp eriments. AndroidControl-v2 Andr oidContr ol-v2 is based on AndroidControl ( Li et al. , 2024 ), an offline Android GUI na vigation b enchmark that pairs step-wise instructions with mobile screenshots and demonstrated actions. Ho w ev er, AndroidControl contains non-trivial annotation noise (ab out 20% errors in action types and/or co ordinates), as illustrated in Figure 6 . T o improv e ev aluation reliabilit y , we use Qwen3-VL-32B-Instruct to filter mismatched samples by chec king the consistency b etw een eac h demonstrated action and its oracle lo w-lev el step instruction. W e start from a sampled set of 500 examples from UGround ( Gou et al. , 2025 ) and obtain a cleaned subset of 398 examples after filtering. W e ev aluate under b oth high-level and low-lev el instructions and rep ort step accuracy , where a step is coun ted as successful only if the predicted action t yp e, textual v alue (when applicable), and target co ordinates are all correct. Note that prio r wo rk may use inconsistent evaluation protocols for this benchmark. F or example, OS-Atlas ( W u et al. , 2025b ) and GUI-R1 ( Luo et al. , 2025 ) treat grounding as a distance-to-target threshold, whic h can b e misleading in practice b ecause UI elements v ary greatly in size (s o a fixed threshold is not comparable across screens). Instead, we follo w UGround’ strategy ( Gou et al. , 2025 ): w e use the accessibility tree to map the predicted co ordinate to its nearest UI element, and then match that elemen t against the ground-truth target. C.3 Online GUI Navigation Benchmarks AndroidW o rld W e ev aluate online mobile agent p erformance on Andr oidWorld ( Rawles et al. , 2025 ), which runs in teractive tasks in Android emulators and scores agents by whether they reac h the correct final device states. AndroidW orld contains 116 tasks across 20 real-w orld apps and co v ers div erse m ulti-step workflo ws (e.g., searc h, form filling, and settings changes) under realistic UI dynamics. With the official Do ck er environmen t, w e found that T ask #82 ( SimpleSmsRepl yMostRecent ) cannot b e initialized, so we report results on the remaining 115 tasks. Ev aluation uses the b enchmark’s rule-based completion chec ker with a maximum horizon of 20 steps. T o assess self-v erification, w e coun t a task as successful only when (i) the agen t explicitly 35 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL outputs a Terminate action and (ii) the en vironment state satisfies the completion rules. Our agent follows the See-Act-V framework follo wing UGround ( Gou et al. , 2025 ), but we remove the step-wise reflection and summa ry mo dule. This design choice isolates the native capability of the underlying mo del, rather than relying on a hand-crafted control structure. F or completeness, we also rep ort baseline results with the summary modules, and sho w that our native mo del (without such mo dules) can surpass agents that dep end on these additional comp onen ts, highligh ting the p otential of our recip e to reduce human-designed scaffolding. W ebArena-Lite-v2 W e ev aluate online w eb agen t p erformance on WebAr ena-Lite-v2 ( Liu et al. , 2025c ), a lo cally deplo y ed w ebsite environmen ts upgraded from WebAr ena-Lite ( Liu et al. , 2025a ) consisting of 154 tasks. T o setup the website environmen t, we follow the instruction by Maxime Gasse 3 for a more stable Map setup 4 . Building up on the official W ebArena-Lite-v2 implementation, we further improv e the robustness of action parsing and execution b y: (i) parsing the ”resp onse” action as ”answer” rather than treating it as illegal; (ii) app ending an automated ”terminate” action after ”action” and ”resp onse”; (iii) supp orting multi-line answ er strings (separated by \ n ) instead of only the first line; (iv) clearing blank conten t b efore typing predicted messages. Moreov er, we replace gpt-4o-2024-11-20 with gpt-5 for more accurate LLM-based fuzzy ev aluation as w e observed false positives when using GPT-4o. Giv en the high v ariance in results, we rep ort the av erage across four runs for all exp eriments, with the temp erature set to 0.0 and top p to 1.0. Online-Mind2W eb W e also include Online-Mind2W eb b enchmark ( Xue et al. , 2025 ) spanning 136 live real- w orld websites and cov ering 300 web agent tasks. W e set the maxim um interaction steps to 30 for each task. W e adopt the prop osed W ebJudge metho d back ed by either o4-mini 5 or W ebJudge-7B 6 mo dels. The LLM-based judge first identifies key p oin ts of the task, then selects task-relev ant key screenshots from eac h step. Finally , the judge is provided with task description, agent textual actions, task completion key p oin ts and selected k ey screenshots to make a binary outcome judgment indicating whether all key p oints are satisfied. D Additional Results This section presents additional results to further clarify our approach. W e include an auxiliary study on grounding as a single-step verifiable case in App endix D.1 to contrast with multi-step navigation under partial verifiabilit y , a controlled comparison of reasoning-augmentation mo dels in App endix D.2 , as well as complemen tary offline metrics (grounding accuracy , action-type accuracy or op eration F1) in App endix D.4 . D.1 Grounding as a Single-step V erifiable Setting Grounding provides a near-ideal single-step verifiable setting: each example t ypically refers to a sp ecific UI elemen t, and we can directly verify correctness b y c hecking whether the predicted co ordinate falls inside the annotated b ounding b ox. Since the agent pro duces only a single action, this setting matches the assumptions in Corollary 5.3.1 . T o study this regime, we train Qw en3-VL-4B and Qwen3-VL-8B with GRPO on a 40K do wnsampled grounding dataset from GT A1 ( Y ang et al. , 2025b ). Results are shown in Figure 12 . Unlik e navigation, grounding does not exhibit an offline–online ev aluation gap; therefore, w e assess pr e dictability using t wo grounding b enchmarks instead: ScreenSp ot-v2 and ScreenSpot-Pro. ScreenSp ot-v2 is closer to our training distribution (similar image resolution), while ScreenSp ot-Pro contains substantially higher-resolution screenshots, making it a useful test of distribution shift. Figures 12 (a)–(b) show that p erformance on b oth b enchmarks improv es steadily and then plateaus after roughly 200 RL steps, without the significant rew ard-hac king-st yle drops observed in multi-step na vigation. Moreov er, the tw o b enchmarks are strongly correlated: Figure 12 (c) sho ws a tigh t relationship b etw een ScreenSp ot-v2 and ScreenSp ot-Pro scores, and Figure 12 (d) rep orts v ery high P earson and Sp earman correlations. Interestingly , removing KL regularization yields even higher correlation (Pearson ≈ 0 . 98), consistent with our analysis that in single-step, verifiable tasks, RL VR-style training can remain stable and highly predictable ev en without KL regularization. 3 https://github . com/gasse/webarena- setup/tree/main/webarena 4 W e found the container for Map w ebsite is brittle when following the official W ebArena-Lite-v2 setup: https://github . com/ OpenGVLab/ScaleCUA/tree/main/evaluation/WebArenaLiteV2 . 5 https://developers . openai . com/api/docs/models/o4- mini 6 https://huggingface . co/osunlp/WebJudge- 7B 36 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL (a) (b) (c) (d) Figure 12 RL for grounding exhibits stable impro vemen ts and strong cross-benchmark predictability . (a) ScreenSpot-V2 and (b) ScreenSp ot-Pro p erformance ov er RL training. (c) Correlation betw een the tw o b enchmark scores across c heckpoints. (d) Pearson and Spearman correlations with and without KL regularization. GPT-4o o4-mini GPT-4.1 28 29 30 31 32 33 34 35 36 37 Element Accuracy with UGround(%) 31.60 33.17 35.33 MM-Mind2Web (Cross-Website) Qwen2.5-VL-3B (Base Model) Figure 13 P erformance comparison of different models for reasoning generation. All mo dels are used to augment the same 30K web samples from AGUVIS and are fine-tuned on the same 3B base mo del. This controlled ev aluation rev eals substan tial p erformance differences among generator mo dels. D.2 Compa ring Different Mo dels for Reasoning Augmentation In our reasoning augmentation pip eline, w e use GPT-4.1 to generate reasoning traces. This c hoice is motiv ated b y preliminary exp eriments showing that GPT-4.1 can b e prompted to pro duce richer, more informativ e rationales for VLM training. With the same prompt, GPT-4o often generates shorter reasoning, while reasoning mo dels suc h as o4-mini t ypically provide limite d visible traces and hide their true reasoning traces, resulting in similarly short rationales. In contrast, GPT-4.1 more reliably pro duces detailed reasoning that pro vide more useful thought and b etter aligns with the actions. W e further quantify this effect with a controlled comparison. Using the same 30K web samples from AGUVIS, w e generate reasoning traces with each model under an identical prompt and fine-tune the same Qwen2.5- VL-3B-Instruct base mo del on the resulting augmented data. T o isolate the impact of reasoning quality , w e extract action targets from the mo del outputs and use the same UGround-7B-v1 ( Gou et al. , 2025 ) mo del for co ordinate prediction across all settings. Figure 13 rep orts results on the original MM-Mind2W eb Cross-W ebsite subset: GPT-4.1 yields the b est p erformance, outp erforming GPT-4o b y +3 . 7 and o4-mini by +2 . 1. These results indicate that the choice of reasoning generator is an imp ortant factor in effectiv e reasoning augmen tation. 37 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL D.3 Compa ring with Unifo rm Negative Gradient Scaling Strategy In our metho d, we use an adaptiv e negative gradient scaling metho d to e nable adaptive scaling b ehaviors. T able 15 compares our adaptive negativ e gradient scaling strategy with a uniform v ariant that applies a constan t scaling factor λ g across all states. Uniform scaling yields weak er o verall p erformance: b oth λ g = 0 . 75 and λ g = 0 . 9 p erform worse on MM-Mind2W eb-v2, AC-v2 (High/Low), and AndroidW orld. These results suggest that a single global scaling factor cannot capture the heterogeneous difficulty and am biguity across states and can lead to sub optimal optimization, whereas adaptive scaling provides the flexibility needed to stabilize training and improv e decision making. T able 15 Ablations study of negative scaling strategy on 3B base mo del. Model MM-Mind2Web-v2 AC-v2 (High) AC-v2 (Lo w) AndroidWorld Pass@1 Pass@4 Pass@1 P ass@4 P ass@1 P ass@4 GUI-Libra ( λ g = 0 . 75) 42.2 48.5 54.5 64.1 86.7 90.2 19.1 GUI-Libra ( λ g = 0 . 9) 42.7 48.1 52.3 61.1 87.4 90.7 20.0 GUI-Libra 42.7 50.2 55.8 65.8 89.5 91.5 25.2 D.4 Additional Metrics on Offline Benchma rks In our main offline ev aluations, we use step ac cur acy as the primary metric. T o b etter understand wher e the gains come from, we also rep ort decomp osed metrics that separate grounding from action prediction. Sp ecifically , we rep ort (i) grounding accuracy (whether the predicted co ordinate lies inside the target element) and action-t yp e accuracy on AndroidCon trol-v2, and (ii) grounding accuracy and op eration F1 on MM-Mind2W eb- v2. Op eration F1 is computed follo wing the proto col in App endix C.2 . Andr oidContr ol-v2. T ables 16 and 17 sho w that GUI-Libra remains strong under these more fine-grained metrics. Notably , GUI-Libra-8B achiev es the b est Pass@1 grounding accuracy on b oth high-level and low-lev el tasks, reaching 76.3 and 95.5 Pass@1 on the high-level and low-lev el settings, resp ectively . It also outp erforms GPT-5 + UGround-7B-v1 and larger op en-weigh t mo dels (e.g., 32B and 72B). The ov erall trend for action-t yp e accuracy is similar, but the gaps to strong baselines are smaller, suggesting that our impro vemen ts are driv en primarily by b etter grounding rather than action-type prediction. This is exp ected: action t yp es are often inferred reliably from language and prior knowledge, whereas accurate grounding, esp ecially for high-lev el, goal-directed tasks, is more challenging for GUI tasks. MM-Mind2Web-v2. T ables 18 and 19 rep ort grounding accuracy and operation F1, resp ectiv ely . F or grounding, GUI-Libra-8B achiev es the highest Pass@1 across all three subsets, with an a v erage Pass@1 of 57.8, surpassing strong baselines including Qwen3-VL-32B, Qwen2.5-VL-72B, and GPT-5 with UGround. F or P ass@4, Qw en2.5-VL-72B is 1.4 p oints higher than GUI-Libra-8B, indicating that w eb-domain ev aluation can still benefit from additional mo del capacity and/or web-specific training data. W e attribute this gap mainly to domain imbalance in our training set: roughly 85% of our SFT data comes from mobile, with only 15% from w eb. W e exp ect that scaling high-qualit y web data would further impro v e P ass@4, consistent with our strong Pass@1/P ass@4 results on AndroidCon trol-v2. F or op eration F1, we rep ort Best@N (the maximum o v er N ∈ { 1 , 4 } samples), since F1 is a con tinuous score in [0 , 1]. While larger mo dels achiev e strong Best@4 results, GUI-Libra-7B attains the b est Best@1 p erformance (85.1), and GUI-Libra-3B remains comp etitive with 32B–72B baselines. Ov erall, these additional metrics sho w that GUI-Libra improv es not only step accuracy but also its underlying comp onen ts, pro viding a more complete view of the sources of impro v emen t. E Pro ofs fo r Theoretical Analysis This sec tion provides detailed pro ofs for the theoretical results in Sec. 5.3 . 38 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 16 Grounding Accuracy on AndroidControl-v2. High Level Lo w Level Mo del P ass@1 Pass@4 P ass@1 Pass@4 Prop rietary Mo dels with SeeAct-V F ramew ork GPT-4o + UGround-v1-7B 58.7 67.3 83.4 88.3 GPT-4.1 + UGround-v1-7B 60.5 64.6 83.0 86.1 GPT-5-mini + UGround-v1-7B 61.9 66.8 83.9 87.9 GPT-5 + UGround-v1-7B 74.4 82.1 93.7 94.6 Op en-source Native Mo dels GUI-R1-3B 53.4 66.4 81.2 87.0 GUI-R1-7B 58.3 72.2 87.9 87.9 Aguvis-7B 67.3 78.0 85.7 87.0 UI-T ARS-1.5-7B 57.9 71.3 78.5 88.8 GLM-4.1V-9B-Thinking 44.0 55.6 68.2 75.8 Qw en2.5-VL-32B 49.0 66.6 78.4 85.4 Qw en2.5-VL-72B 56.5 72.9 82.9 90.2 Qw en3-VL-32B 70.9 79.4 91.9 93.3 Qw en2.5-VL-3B (Baseline) 42.2 46.6 78.0 80.7 GUI-Lib ra-3B (Ours) 68.2 77.1 89.2 91.5 Qw en2.5-VL-7B (Baseline) 60.5 69.5 82.5 88.3 GUI-Lib ra-7B (Ours) 71.8 78.0 91.5 93.7 Qw en3-VL-4B (Baseline) 65.9 76.7 90.6 93.3 GUI-Lib ra-4B (Ours) 74.0 81.6 91.0 95.1 Qw en3-VL-8B (Baseline) 67.7 79.8 93.3 95.1 GUI-Lib ra-8B (Ours) 76.2 81.6 95.5 96.4 39 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 17 Type Accuracy on AndroidControl-v2. High Level Lo w Level Mo del P ass@1 Pass@4 P ass@1 Pass@4 Prop rietary Mo dels with SeeAct-V F ramew ork GPT-4o + UGround-v1-7B 73.9 82.7 89.2 93.7 GPT-4.1 + UGround-v1-7B 72.9 78.9 89.5 93.0 GPT-5-mini + UGround-v1-7B 70.9 77.4 89.7 93.5 GPT-5 + UGround-v1-7B 74.1 78.9 88.4 91.5 Op en-source Native Mo dels GUI-R1-3B 63.1 76.4 71.1 90.2 GUI-R1-7B 61.8 76.1 76.1 89.2 Aguvis-7B 58.0 59.8 60.6 62.1 UI-T ARS-1.5-7B 69.4 87.4 70.9 91.0 GLM-4.1V-9B-Thinking 65.8 74.1 89.2 92.5 Qw en2.5-VL-32B 67.3 78.1 85.7 89.7 Qw en2.5-VL-72B 72.9 86.9 89.5 96.2 Qw en3-VL-32B 73.6 81.2 90.2 92.2 Qw en2.5-VL-3B (Baseline) 53.0 77.4 80.2 90.0 GUI-Lib ra-3B (Ours) 70.4 78.1 89.5 93.7 Qw en2.5-VL-7B (Baseline) 62.8 77.4 78.6 91.7 GUI-Lib ra-7B (Ours) 72.9 78.6 88.9 91.2 Qw en3-VL-4B (Baseline) 63.6 78.4 86.4 88.7 GUI-Lib ra-4B (Ours) 72.6 76.1 92.5 95.0 Qw en3-VL-8B (Baseline) 68.8 77.6 82.4 86.9 GUI-Lib ra-8B (Ours) 74.1 78.9 93.7 95.5 40 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 18 Grounding accuracy (%) on Multimo dal-Mind2W eb-v2. Cross-T ask Cross-W ebsite Cross-Domain Average Mo d el P ass@1 Pass@4 Pass@1 Pass@4 Pass@1 Pass@4 Pass@1 Pass@4 Prop rietary Mo dels with SeeAct-V Framew ork GPT-4o + UGround-v1-7B 42.0 44.0 40.6 42.9 44.9 47.2 42.5 44.7 GPT-4.1 + UGround-v1-7B 47.4 49.3 43.5 45.5 48.7 51.4 46.5 48.7 GPT-5-mini + UGround-v1-7B 52.5 53.1 50.7 51.6 53.6 54.3 52.3 53.0 GPT-5 + UGround-v1-7B 55.3 56.6 53.1 53.4 55.8 57.3 54.7 55.8 Op en-source Native Models GUI-R1-3B 29.1 44.6 27.2 45.5 29.1 44.9 28.5 45.0 GUI-R1-7B 43.8 56.8 40.6 54.9 45.7 56.3 43.4 56.0 Aguvis-7B 40.5 53.5 34.9 47.4 38.8 48.8 38.1 49.9 UI-T ARS-1.5-7B 46.2 59.0 42.2 57.6 46.4 60.2 45.0 58.9 GLM-4.1V-9B-Thinking 36.8 42.8 34.2 39.7 36.9 44.1 36.0 42.2 Qw en2.5-VL-32B 53.0 61.8 50.9 61.6 52.6 62.9 52.2 62.1 Qw en2.5-VL-72B 55.7 65.0 53.7 61.5 56.3 64.4 55.2 63.6 Qw en3-VL-32B 55.9 62.4 52.7 61.7 56.7 63.9 55.1 62.7 Qw en2.5-VL-3B (Baseline) 37.9 40.1 35.6 38.6 38.8 43.2 37.4 40.6 GUI-Lib ra-3B (Ours) 50.1 57.8 50.7 58.6 51.6 58.5 50.8 58.3 Qw en2.5-VL-7B (Baseline) 44.1 54.0 45.5 54.2 46.2 57.4 45.3 55.2 GUI-Lib ra-7B (Ours) 53.5 59.8 54.2 60.4 53.6 60.7 53.7 60.3 Qw en3-VL-4B (Baseline) 49.9 59.6 47.9 58.2 49.9 57.5 49.2 58.4 GUI-Lib ra-4B (Ours) 57.2 61.7 56.2 61.4 56.7 62.1 56.7 61.7 Qw en3-VL-8B (Baseline) 52.6 60.0 49.6 59.3 52.4 59.3 51.5 59.5 GUI-Lib ra-8B (Ours) 58.6 62.6 56.5 61.2 58.2 62.5 57.8 62.1 E.0.1 When do es offline step-wise matching p redict online success? Theorem E.1: Offline-to-online b ound under partial verifiabilit y (Restatement of Theorem 5.1 ) Assume Assumption 5.1 and that, for all t ∈ [ H ], d π ,t ( s ) > 0 implies d µ ( s ) > 0 (i.e., supp ( d π ,t ) ⊆ supp ( d µ )). This condition ensures the o ccupancy ratio C ( π ) is well-defined. Then the online success probabilit y satisfies J ( π ) ≥ 1 − H · C ( π ) · 1 − M off ( π ) − ¯ η π . (13) In particular, if C ( π ) is uniformly b ounded ov er a p olicy class and ¯ η π is small or stable acr oss p olicies , then M off ( π ) b ecomes predictiv e of J ( π ) through the affine low er b ound. Pr o of. Let E t denote the even t that the action at step t is in v alid: E t ≜ { a t / ∈ A ∗ ( s t ) } . By Assumption 5.1 , failure implies that at least one inv alid step o ccurs, i.e., { failure } ⊆ H [ t =1 E t . Therefore, 1 − J ( π ) = P (failure) ≤ P H [ t =1 E t ! ≤ H X t =1 P ( E t ) , 41 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T able 19 Op eration F1 score on Multimo dal-Mind2W eb-v2. Cross-T ask Cross-W ebsite Cross-Domain Average Mo d el Best@1 Best@4 Best@1 Best@4 Best@1 Best@4 Best@1 Best@4 Prop rietary Mo dels with SeeAct-V Framew ork GPT-4o + UGround-v1-7B 70.1 78.9 70.9 79.6 67.7 75.7 69.6 78.1 GPT-4.1 + UGround-v1-7B 72.7 81.6 71.1 81.0 71.1 79.4 71.6 80.7 GPT-5-mini + UGround-v1-7B 80.7 88.8 77.3 86.8 77.9 85.3 78.6 87.0 GPT-5 + UGround-v1-7B 79.7 89.2 79.6 88.9 79.1 88.2 79.5 88.8 Op en-source Native Models GUI-R1-3B 79.5 87.9 75.4 85.3 79.6 89.4 78.1 87.5 GUI-R1-7B 78.9 89.6 75.2 88.5 79.8 90.7 78.0 89.6 Aguvis-7B 84.4 91.1 80.7 87.9 83.3 91.5 82.8 90.2 UI-T ARS-1.5-7B 81.2 85.3 78.5 81.6 81.1 85.0 80.3 84.0 GLM-4.1V-9B-Thinking 73.6 82.0 69.2 80.0 73.7 81.6 72.2 81.2 Qw en2.5-VL-32B 84.2 92.6 81.9 92.0 82.6 93.0 82.9 92.5 Qw en2.5-VL-72B 85.2 93.5 83.3 91.5 84.3 93.0 84.3 92.7 Qw en3-VL-32B 83.9 91.6 81.3 89.9 83.7 92.2 83.0 91.2 Qw en2.5-VL-3B (Baseline) 59.3 86.5 51.6 83.0 64.7 88.3 58.5 85.9 GUI-Lib ra-3B (Ours) 84.6 88.0 80.8 85.2 86.4 89.2 83.9 87.5 Qw en2.5-VL-7B (Baseline) 68.1 90.9 66.7 90.4 73.3 91.6 69.4 91.0 GUI-Lib ra-7B (Ours) 85.2 88.1 84.3 88.2 85.9 89.4 85.1 88.6 Qw en3-VL-4B (Baseline) 81.6 88.9 78.9 88.2 79.5 89.9 80.0 89.0 GUI-Lib ra-4B (Ours) 85.0 88.9 83.6 88.2 82.8 88.1 83.8 88.4 Qw en3-VL-8B (Baseline) 83.2 91.2 80.0 89.0 82.7 90.2 82.0 90.1 GUI-Lib ra-8B (Ours) 85.5 88.1 81.8 86.9 84.4 87.0 83.9 87.3 where the last inequality is the union b ound. Fix any t ∈ [ H ]. Conditioning on s t giv es P ( E t ) = E s ∼ d π,t h P ( a t / ∈ A ∗ ( s ) | s t = s ) i = E s ∼ d π,t h 1 − π ( A ∗ ( s ) | s ) i . Define the nonnegative function f ( s ) ≜ 1 − π ( A ∗ ( s ) | s ) ≥ 0. By the definition of C ( π ) and the supp ort condition d π ,t ≪ d µ , E s ∼ d π,t [ f ( s )] = E s ∼ d µ d π ,t ( s ) d µ ( s ) f ( s ) ≤ C ( π ) · E s ∼ d µ [ f ( s )] . Next, since the offline dataset verifies only the demonstrated action ˜ a ( s ) ∈ A ∗ ( s ), w e can decomp ose the true step-wise v alidit y probability as π ( A ∗ ( s ) | s ) = π ( ˜ a ( s ) | s ) + π ( A ∗ ( s ) \ { ˜ a ( s ) } | s ) = π (˜ a ( s ) | s ) + η π ( s ) . T aking exp ectation ov er s ∼ d µ yields E s ∼ d µ π ( A ∗ ( s ) | s ) = M off ( π ) + ¯ η π , and hence E s ∼ d µ [ f ( s )] = 1 − E s ∼ d µ π ( A ∗ ( s ) | s ) = 1 − M off ( π ) − ¯ η π . Com bining the ab ov e b ounds, for each t w e hav e P ( E t ) ≤ C ( π ) · 1 − M off ( π ) − ¯ η π . Summing ov er t = 1 , . . . , H and using 1 − J ( π ) ≤ P H t =1 P ( E t ) gives 1 − J ( π ) ≤ H · C ( π ) · 1 − M off ( π ) − ¯ η π , whic h is equiv alent to Eq 13 . This concludes the pro of. 42 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Discussion. Theorem E.1 (Theorem 5.1 ) clarifies why an offline one-step matc hing score can b e an unreliable pro xy for the online success probability J ( π ) in multi-step GUI navigation under partial v erifiabilit y . Offline ev aluation is computed on a fixed dataset distribution d µ and credits only the single demonstrated action ˜ a ( s ). In contrast, online success depends on the policy-induced state distributions { d π ,t } H t =1 and on choosing any v alid action in A ∗ ( s ) o ver a long horizon. The theorem makes this mismatch explicit through t w o quan tities: the o ccupancy mismatch C ( π ) and the unobserv ed off-demo v alidit y mass ¯ η π . As a result, offline-to-online predictabilit y can break down in several w a ys. (i) Distribution shift and error accumulation (la rge C ( π ) ). Ev en if π matc hes the demonstrator well on states dra wn from d µ , small errors can comp ound ov er time and shift the online tra jec tory distribution aw ay from the offline supp ort. When C ( π ) is large, M off ( π ) pro vides limited information ab out the states that dominate online p erformance. In this regime, M off ( π ) ma y impro v e while J ( π ) stagnates (or decreases) b ecause failures are driven by states that are rarely or never seen in the offline data. (ii) Non-identifiabilit y under partial verifiability (unstable ¯ η π ). Offline matc hing measures only π ( ˜ a ( s ) | s ), whereas the true one-step v alidity is π ( A ∗ ( s ) | s ) = π ( ˜ a ( s ) | s ) + η π ( s ) , η π ( s ) = π ( A ∗ ( s ) \ { ˜ a ( s ) } | s ) . Therefore, M off ( π ) do es not determine true step v alidity unless the off-demo v alidity mass η π ( s ) is negligible or approximately in v ariant across the policies b eing compared. Imp ortan tly , a lar ger ¯ η π do es not imply b etter predictabilit y: the issue is that η π is unobserved under offline verification and can v ary substantially across p olicies, so c hanges in M off ( π ) ma y reflect reallo cation of probability mass rather than genuine improv ements in correctness. Example E.1: Larger η π do es not mean b etter predictability Consider a single decision state s (so C ( π ) = 1) with three actions: a demonstrated v alid action a ⋆ = ˜ a ( s ), an alternative v alid action a ′ , and an inv alid action a bad . Thus A ∗ ( s ) = { a ⋆ , a ′ } and choosing any action outside A ∗ ( s ) causes failure. Offline matching credits only a ⋆ , so M off ( π ) = π ( a ⋆ | s ) , η π ( s ) = π ( a ′ | s ) , J ( π ) = π ( a ⋆ | s ) + π ( a ′ | s ) = 1 − π ( a bad | s ) . Compare tw o p olicies: π 1 ( a ⋆ | s ) = 0 . 2 , π 1 ( a ′ | s ) = 0 . 7 , π 1 ( a bad | s ) = 0 . 1 ⇒ M off ( π 1 ) = 0 . 2 , η π 1 ( s ) = 0 . 7 , J ( π 1 ) = 0 . 9 , π 2 ( a ⋆ | s ) = 0 . 4 , π 2 ( a ′ | s ) = 0 . 1 , π 2 ( a bad | s ) = 0 . 5 ⇒ M off ( π 2 ) = 0 . 4 , η π 2 ( s ) = 0 . 1 , J ( π 2 ) = 0 . 5 . Here the offline score increases (0 . 2 → 0 . 4), while the off-demo v alidity mass c hanges dramatically (0 . 7 → 0 . 1) and the true success probabilit y drops sharply (0 . 9 → 0 . 5). This happ ens b ecause M off only trac ks probabilit y on the single demonstrated action a ⋆ ; it cannot distinguish whether probability mass is reallo cated from other v alid alternativ es a ′ (uncredited) to inv alid actions a bad . (iii) Offline overfitting can reduce online robustness. Example E.1 also illustrates another failure mo de: maximizing an offline demo-matc hing score can encourage demo-sp ecific b ehavior. Because M off ( π ) rew ards only matching the single demonstrated action ˜ a ( s ), a p olicy may increase M off ( π ) b y concen trating probabilit y mass on ˜ a ( s ) while reducing exploration of other v alid alternatives. This ov erfitting reduces b ehavioral div ersit y and w eak ens reco v ery strategies that are essential for interactiv e agents. In long-horizon GUI navigation, the impact is amplified b y error accumulation. A small early mistake can mov e the agent to states that are p o orly co v ered by the offline data, where the p olicy has not learned robust correction b ehaviors. This increases state-distribution shift (larger C ( π )) and can low er the ov erall success probabilit y J ( π ). Consequently , it is p ossible for M off ( π ) to impro v e while J ( π ) decreases. T aken together, these failure mo des explain why offline one-step matc hing can be p oorly predictive of online success in multi-step GUI navigation: offline ev aluation measures demo-matching under d µ , whereas online success dep ends on long-horizon v alidity under { d π ,t } and is confounded by unobserved (and p otentially unstable) off-demo v alidity mass. 43 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL E.0.2 KL regularization imp roves p redictability Man y RL VR pip elines drop the KL regularization term for efficiency ( Y u et al. , 2025 ; Liu et al. , 2025d ; Zhou et al. , 2025b ; Y ang et al. , 2025b ). In our setting, how ever, KL regularization is imp ortant b ecause step-wise offline matching is only partially verifiable: matches to the demonstrated action provide reliable p ositive signal, while non-matches are ambiguous. Below we connect KL regularization (to a reference p olicy) to the t w o quantities that go vern offline-to-online predictabilit y: the o ccupancy mismatch C ( π ) and the off-demo v alidit y mass ¯ η π . Lemma E.1: KL regularization controls distribution shift Let π ref b e a reference p olicy and assume a per-state KL constraint KL( π ( · | s ) ∥ π ref ( · | s )) ≤ ε, ∀ s ∈ S . Then Pinsker’s inequality implies TV( π ( · | s ) , π ref ( · | s )) ≤ p ε/ 2 , ∀ s ∈ S . Consequen tly , the induced state visitation distributions satisfy ∥ d π ,t − d π ref ,t ∥ 1 ≤ t √ 2 ε, ∀ t ∈ [ H ] . Moreo v e r, if d µ ( s ) ≥ ρ > 0 whenever d µ ( s ) > 0 and d π ref ,t ≪ d µ , then sup s : d µ ( s ) > 0 d π ,t ( s ) d µ ( s ) ≤ sup s : d µ ( s ) > 0 d π ref ,t ( s ) d µ ( s ) + ∥ d π ,t − d π ref ,t ∥ 1 ρ ≤ C ( π ref ) + t √ 2 ε ρ . In particular, KL regularization b ounds how muc h the o ccupancy mismatc h C ( π ) can grow relative to π ref . Pr o of. Step 1: F rom KL to p er-state TV. By Pinsker’s inequality , for ev ery s , TV( π ( · | s ) , π ref ( · | s )) ≤ q 1 2 KL( π ( · | s ) ∥ π ref ( · | s )) ≤ p ε/ 2 . Equiv alen tly , for every measurable set A ⊆ A , π ( A | s ) − π ref ( A | s ) ≤ TV( π ( · | s ) , π ref ( · | s )) ≤ p ε/ 2 . (14) Step 2: One-step p ropagation b ound. Let P π denote the (time-homogeneous) state-transition op erator induced b y π : ( P π ν )( s ′ ) ≜ X s ∈S ν ( s ) X a ∈A π ( a | s ) P ( s ′ | s, a ) , and similarly define P π ref . F or any distribution ν ov er states, consider ∥ ν P π − ν P π ref ∥ 1 . F or each s ′ , define P π ( s ′ | s ) ≜ X a π ( a | s ) P ( s ′ | s, a ) , P π ref ( s ′ | s ) ≜ X a π ref ( a | s ) P ( s ′ | s, a ) . Then ∥ ν P π − ν P π ref ∥ 1 = X s ν ( s ) P π ( · | s ) − P π ref ( · | s ) 1 ≤ X s ν ( s ) P π ( · | s ) − P π ref ( · | s ) 1 . Moreo v er, for each fixed s , P π ( · | s ) − P π ref ( · | s ) 1 = X a π ( a | s ) − π ref ( a | s ) P ( · | s, a ) 1 ≤ X a π ( a | s ) − π ref ( a | s ) = ∥ π ( · | s ) − π ref ( · | s ) ∥ 1 . 44 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Using ∥ · ∥ 1 = 2 TV( · , · ) for distributions, P π ( · | s ) − P π ref ( · | s ) 1 ≤ 2 TV ( π ( · | s ) , π ref ( · | s )) ≤ 2 p ε/ 2 = √ 2 ε. Therefore, for any ν , ∥ ν P π − ν P π ref ∥ 1 ≤ √ 2 ε. (15) Step 3: T elescoping over t steps. Let d π ,t and d π ref ,t denote the state distributions at step t under π and π ref , resp ectiv ely . Using the recursion d π ,t +1 = d π ,t P π and d π ref ,t +1 = d π ref ,t P π ref , we hav e ∥ d π ,t +1 − d π ref ,t +1 ∥ 1 = ∥ d π ,t P π − d π ref ,t P π ref ∥ 1 ≤ ∥ d π ,t P π − d π ,t P π ref ∥ 1 | {z } ≤ √ 2 ε by equation 15 + ∥ d π ,t P π ref − d π ref ,t P π ref ∥ 1 | {z } ≤∥ d π,t − d π ref ,t ∥ 1 . Th us, ∥ d π ,t +1 − d π ref ,t +1 ∥ 1 ≤ ∥ d π ,t − d π ref ,t ∥ 1 + √ 2 ε. Iterating this inequality and using ∥ d π , 0 − d π ref , 0 ∥ 1 = 0 (same initial distribution) yields ∥ d π ,t − d π ref ,t ∥ 1 ≤ t √ 2 ε. Step 4: Bounding o ccupancy mismatch relative to d µ . Assume d π ref ,t ≪ d µ and d µ ( s ) ≥ ρ > 0 whenev er d µ ( s ) > 0. F or an y s with d µ ( s ) > 0, d π ,t ( s ) d µ ( s ) = d π ref ,t ( s ) d µ ( s ) + d π ,t ( s ) − d π ref ,t ( s ) d µ ( s ) ≤ d π ref ,t ( s ) d µ ( s ) + | d π ,t ( s ) − d π ref ,t ( s ) | ρ . T aking supremum ov er s with d µ ( s ) > 0 and using sup s | x s | ≤ P s | x s | = ∥ x ∥ 1 giv es sup s : d µ ( s ) > 0 d π ,t ( s ) d µ ( s ) ≤ sup s : d µ ( s ) > 0 d π ref ,t ( s ) d µ ( s ) + ∥ d π ,t − d π ref ,t ∥ 1 ρ ≤ C ( π ref ) + t √ 2 ε ρ , whic h completes the pro of. Lemma E.2: KL limits off-demo v alidit y mass under single-demo sup ervision Fix a state s and let ˜ a = ˜ a ( s ) be the demonstrated action. Assume KL ( π ( · | s ) ∥ π ref ( · | s )) ≤ ε and that the reference p olicy is demo-concentrated: π ref (˜ a | s ) ≥ 1 − δ ( s ). Then 1 − π (˜ a | s ) ≤ δ ( s ) + p ε/ 2 , η π ( s ) = π ( A ∗ ( s ) \ { ˜ a } | s ) ≤ 1 − π ( ˜ a | s ) , and therefore η π ( s ) ≤ δ ( s ) + p ε/ 2 , ¯ η π ≤ ¯ δ + p ε/ 2 , ¯ δ ≜ E s ∼ d µ [ δ ( s )] . Pr o of. By Pinsker’s inequality , TV( π ( · | s ) , π ref ( · | s )) ≤ p ε/ 2 . Let E ≜ { a = ˜ a } . Using the standard even t b ound for total v ariation distance, π ( E | s ) ≤ π ref ( E | s ) + TV( π ( · | s ) , π ref ( · | s )) ≤ 1 − π ref (˜ a | s ) + p ε/ 2 ≤ δ ( s ) + p ε/ 2 . Since π ( E | s ) = 1 − π ( ˜ a | s ), we obtain 1 − π (˜ a | s ) ≤ δ ( s ) + p ε/ 2. Next, b ecause A ∗ ( s ) \ { ˜ a } ⊆ E , we hav e η π ( s ) = π ( A ∗ ( s ) \ { ˜ a } | s ) ≤ π ( E | s ) ≤ δ ( s ) + p ε/ 2 . T aking exp ectation ov er s ∼ d µ yields ¯ η π = E s ∼ d µ [ η π ( s )] ≤ E s ∼ d µ [ δ ( s )] + p ε/ 2 = ¯ δ + p ε/ 2 , whic h completes the pro of. 45 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL T ake away. The offline-to-online bound shows that predictability dep ends on (i) distribution shift along the p olicy’s own tra jectories (captured b y C ( π )) and (ii) the unobserved probability mass on v alid-but-uncredited actions under single-demo verification (captured by ¯ η π ). Lemma E.1 and Lemma E.2 explain why KL regularization helps in this setting: a KL trust region simultaneously limits state-distribution drift (controlling C ( π )) and preven ts the policy from moving to o m uch probability mass aw ay from the demonstrated action (con trolling ¯ η π ). As a result, conserv ativ e KL-regularized optimization k eeps training in a regime where impro v ements in the offline matching score M off ( π ) are more likely to reflect genuine improv ements in online success J ( π ). F Prompt T emplates F.1 Prompt fo r Reasoning Augmentation W e use the following prompts with GPT-4.1 to generate reasoning traces, whic h form our initial reasoning dataset. W e draw inspiration from AGUVIS ( Xu et al. , 2025c ) and further refine and adapt the prompt through iterative trial-and-error during our data generation pro cess. Prompt T emplate for W eb Data Please generate detailed reasoning and explain your logic according to the UI screenshot, instruction, in teraction history and the reference action. ## Instruction: {} ## Interaction History: {} ## Reference Action Description for Current Step: {} ## Reference Action Command for Current Step: {} Note: If there exists a referenced element, it is highlighted b y a small red hollow circle in the screenshot. ## Guidelines • Carefully observe and in terpret the screenshot to extract any relev an t information that can inform y our reasoning. • Reason step by s tep to accomplish the instruction, aligning your thought process with the ov erall goal. • Use the reference action as a guide, but do not simply rep eat it. Instead, explain wh y y ou w ould tak e that action (if it is reasonable) without referring to it. • Do not rely on the current reference action or the highlighted hollow circle as justification—they are outcomes of hindsight. • Use a first-p erson p ersp ective to express your reasoning, as if y ou are the one navigating and making decisions. • If the question is primarily in a language other than English, please reason in English but provide the final answer in the corresp onding language. • The reference action may in v olv e b oth clic king and writing simultaneously . In such cases, describ e the action as: first click the target element, then t yp e the desired text. Y ou should sp ecify the action type as W rite and indicate the target element where the text should b e en tered in the json. ## Resp onse Fo rmat: Return exactly one + pair for one step, with the following structured format: Explain your reasoning in detail, step b y step. • State what y ou observ e and extract useful information from the screenshot. Keep imp ortan t information in the thought b ecause you may need to refer to them later. • Reflect on the instruction and in teraction history to understand the context and current status. 46 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL • Decide what action to tak e next and wh y , following the guidelines. Do not summarize—show y our full chain of thought. { ”action description”: ”Describe the current step action with a clear instruction”, ”action type”: ”The type of action to tak e, e.g., Click, Select, W rite, KeyBoardPress, Scroll, T erminate or Answer”, ”action target”: ”Pro vide a description of the element you w an t to op erate on the screen. (If action t yp e == Scroll and Answer, this field should b e ’None’.) It should include the elemen t’s identit y , t yp e (button, input field, drop down men u, tab, etc.), text on it (if hav e), and location hints if needed to disam biguate. If you wan t to write text in to an input field, sp ecify the input field’s identit y and necessary information. If you already click ed the input field in last step, you can leav e this field as ’None’.”, ”v alue”: ”Sp ecify the input or selected v alue. F or W rite, Select, and Answer actions, pro vide the exact text or option. F or Keyb oardPress, sp ecify the k ey(s). F or Scroll, indicate the direction as in the reference action command. F or other action types, use ’None’.” } Prompt T emplate for Mobile Data Please generate detailed reasoning and explain your logic according to the UI screenshot, instruction, in teraction history and the reference action. ## Instruction: {} ## Interaction History: {} ## Reference Action Description for Current Step (lab eled by human): {} ## Reference Action Command for Current Step (lab eled by human): {} Note: If there exists a referenced element, it is highlighted b y a small red hollow circle in the screenshot. ## Guidelines • Carefully observe and in terpret the screenshot to extract any relev an t information that can inform y our reasoning. • Reason step by s tep to accomplish the instruction, aligning your thought process with the ov erall goal. • Use the reference action as a guide, but do not simply rep eat it. Instead, explain wh y y ou w ould tak e that action (if it is reasonable) without referring to it. • Do not rely on the current reference action or the highlighted hollow circle as justification—they are outcomes of hindsight. • Use a first-p erson p ersp ective to express your reasoning, as if y ou are the one navigating and making decisions. • If the question is primarily in a language other than English, please reason in English but provide the final answer in the corresp onding language. • The reference action may in v olv e b oth clic king and writing simultaneously . In such cases, describ e the action as: first click the target element, then t yp e the desired text. Y ou should sp ecify the action type as W rite and indicate the target element where the text should b e en tered in the json. ## Resp onse Fo rmat: 47 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Return exactly one + pair for one step, with the following structured format: Explain your reasoning in detail, step b y step. ■ State what you observ e and extract useful information from the screenshot. Keep imp ortant information in the thought b ecause you may need to refer to them later. ■ Reflect on the instruction and in teraction history to understand the context and current status. ■ Decide what action to tak e next and why , follo wing the guidelines. Do not summarize—show your full chain of though t. { ”action description”: ”Describe the current step action with a clear instruction”, ”action t yp e”: ”The type of action to take, e.g., Click, W rite, LongPress, Scroll, Swip e, NavigateHome, W ait, NavigateBac k, T erminate, Op enAPP , or Answer. If the referece action is terminate and y ou need to answer the question from the user, y ou should sp ecify the action t yp e as Answer and provide the answer in the v alue field.”, ”action target”: ”Provide a description of the element you w an t to operate on the screen. (If action type == Scroll, T erminate, or Answer, this field should b e ’None’). It should include the element’s identit y , t yp e (button, input field, drop down men u, tab, etc.), text on it (if hav e), and location hints if needed to disam biguate. If you wan t to write text in to an input field, sp ecify the input field’s identit y and necessary information. If you already click ed the input field in last step, you can leav e this field as ’None’.”, ”v alue”: ”Specify the input or selected v alue. F or W rite, Select, and Answer actions, pro vide the exact text or option. F or Scroll and Swip e, indicate the direction. F or W ait, sp ecify the w ait time in seconds. F or Op enAPP , sp ecify the app name. F or other action t yp es, use ’None’. ” } F.2 SFT Data Example SFT Data Example Image: 20162/screenshot 1.png 48 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL System Prompt: 49 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of actions to complete the task. You need to choose actions from the following list: action type: Answer, action target: None, value: Answer text, point 2d: None ## Explanation: Return the final answer to the user’s question action type: Click, action target: Element description, value: None, point 2d: [x, y] ## Explanation: Tap or click a specific UI element and provide its coordinates action type: Write, action target: Element description or None, value: Text to enter, point 2d: [x, y] or None ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None action type: LongPress, action target: Element description, value: None, point 2d: [x, y] ## Explanation: Press and hold on a specific UI element (mobile only) and provide its coordinates action type: Scroll, action target: None, value: "up" | "down" | "left" | "right", point 2d: None ## Explanation: Scroll a view or container in the specified direction action type: Swipe, action target: Optional position or None, value: "up" | "down" | "left" | "right", point 2d: [x, y] or None ## Explanation: Perform a swipe gesture on the screen in the given direction (mobile) action type: Wait, action target: None, value: Number of seconds, point 2d: None ## Explanation: Pause execution to allow the UI to load or update action type: NavigateHome, action target: None, value: None, point 2d: None ## Explanation: Navigate to the device’s home screen action type: NavigateBack, action target: None, value: None, point 2d: None ## Explanation: Press the system "Back" button action type: OpenApp, action target: None, value: App name, point 2d: None ## Explanation: Launch an app by its name (mobile only) action type: Terminate, action target: None, value: End-task message, point - 2d: None ## Explanation: Signal the end of the current task with a final message Human Instruction and History: Please generate the next move according to the UI screenshot (origin image size 1080x2400), instruction and previous actions. Instruction: Using the Blossom app, I would like to know the botanical name of the ihi plant in order to identify the type of plant it is. Interaction History: Step 1 Action: Open the Blossom app to start searching for the botanical name of the ihi plant. Resp onse: The screenshot shows that the Blossom app is now open, and I am on the main search interface. The instruction is to find the botanical name of the "ihi" plant. There is a prominent search bar near the center 50 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL of the screen labeled "Type a plant name," which is intended for entering the name of a plant to search for information. Since I need to search for the ihi plant and identify its botanical name, the next logical step is to activate the search bar so I can type "ihi" into it. This will allow me to perform a search for the plant and get the relevant botanical information. { "action_description": "Tap on the search bar labeled ’Type a plant name’ to start entering the plant’s name.", "action_type": "Click", "action_target": "Search bar input field labeled ’Type a plant name’ located below the text ’Find perfect Plants & Veggies’", "value": "None", "point_2d": [500, 397] } F.3 RL Data Example Image: System Prompt: 51 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of actions to complete the task. You need to choose actions from the following list: action type: Answer, action target: None, value: Answer text, point 2d: None ## Explanation: Return the final answer to the user’s question action type: Click, action target: Element description, value: None, point 2d: [x, y] ## Explanation: Tap or click a specific UI element and provide its coordinates action type: Write, action target: Element description or None, value: Text to enter, point 2d: [x, y] or None ## Explanation: Enter text into a specific input field or at the current focus if coordinate is None action type: LongPress, action target: Element description, value: None, point 2d: [x, y] ## Explanation: Press and hold on a specific UI element (mobile only) and provide its coordinates action type: Scroll, action target: None, value: "up" | "down" | "left" | "right", point 2d: None ## Explanation: Scroll a view or container in the specified direction action type: Swipe, action target: Optional position or None, value: "up" | "down" | "left" | "right", point 2d: [x, y] or None ## Explanation: Perform a swipe gesture on the screen in the given direction (mobile) action type: Wait, action target: None, value: Number of seconds, point 2d: None ## Explanation: Pause execution to allow the UI to load or update action type: NavigateHome, action target: None, value: None, point 2d: None ## Explanation: Navigate to the device’s home screen action type: NavigateBack, action target: None, value: None, point 2d: None ## Explanation: Press the system "Back" button action type: OpenApp, action target: None, value: App name, point 2d: None ## Explanation: Launch an app by its name (mobile only) action type: Terminate, action target: None, value: End-task message, point - 2d: None ## Explanation: Signal the end of the current task with a final message Human Instruction and History: 52 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Please generate the next move according to the UI screenshot, instruction and previous actions. Instruction: In the ZArchive App, decompress the DCIM.7z zip file and save it to the Pocketbook folder. Interaction History: Step 1 Action: Open the ZArchiver app to begin the decompression process as required by the instruction. Step 2 Action: Tap on the ’DCIM.7z’ archive file to open the extraction options menu. The response should be structured in the following format. Make sure the output between and is a valid JSON object. Regarding the key "point 2d", please provide the coordinates on the screen where the action is to be performed; if not applicable, use [-100, -100]. Your step-by-step thought process here... { "action_description": "the description of the action to perform, summarized in one sentence", "action_type": "the type of action to perform. Please follow the system prompt for available actions.", "value": "the input text or direction (’up’, ’down’, ’left’, ’right’) for the ’scroll’ action, if applicable; otherwise, use ’None’", "point_2d": [x, y] } The answ er and the additional information used for reward computation are shown b elow. The gt point 2d co ordinates are normalized to the range [0 , 1], while gt bbox is further scaled by a factor of 1000. { ’gt_action’: ’Click’, ’gt_input_text’: ’None’, ’gt_point_2d’: [0.194, 0.843], ’gt_target’: "’Extract...’ option in the extraction context menu for DCIM.7z", ’image_height’: 2400, ’image_width’: 1080, ’gt_bbox’: [36, 824, 268, 857] } G Long-Ho rizon T rajecto ry Case Studies Figures 14 and 15 compare GUI-Libra-7B with its base model, Qwen2.5-VL-7B-Instruct, on AndroidW orld T ask 18 ( ExpenseDeleteMul tiple ). The task requires deleting three sp ecific exp enses in Pro Exp ense: Scho ol Supplies , R eligious , and Flight Tickets . As shown in Figure 14 , GUI-Libra successfully completes this long-horizon task by alternating b etw een iterative reasoning and grounded actions. In contrast, the base mo del requires more steps to disco ver how to delete a single item and then fails to remov e the second one, highligh ting its difficult y in sustaining multi-step progress and demonstrating the adv antage of GUI-Libra on long-horizon decision making. W e also include a web navigation case study in Figure 17 , where GUI-Libra-4B successfully completes a long-horizon W ebArena-Lite-v2 task, further illustrating strong generalization to multi-step web interactions. 53 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Step 1: Reasoning: …Since the app is not present, my next logical step is to swipe up or sideways to access the app drawer or another home screen page whe re the "pro expense" app might be located. Action : Swipe up on the home screen t o open the app drawer and access all installed apps. Step 2: Reasoning : … Since the app drawer is not sho wing the "pro expense" app, my next logical step is to swipe up to reveal more apps, as the app may be located on a subsequent page. Action : Swipe up on t he app drawer to reveal more apps and look for the 'pro expense' app. Step 3: Reasoning: … I see the "Pro Expense" app icon in the third row, se cond colu mn from the left. Since the app is not yet open, my next logic al step is to tap on the "Pro Expense" app icon to launch it and proceed with the task of deleting the specified expenses. Action : Tap on the 'Pro Expense' app i con to open the app and access its contents. Step 4: Reasoning: … I need to access the details o r edit page for each expen se. On the Pro Expense app, expens es are typically managed by tapping on their names in the "Recen t" section. Therefore, my next step is to tap on the "School Supplies" expense … Action : Tap on the 'Schoo l Supplies' expe nse in the Rec ent section to view its detail s and look for an optio n to delete i t.. Step 5: Reasoning: … Tapping this icon shoul d bring up a confirmation dialog to delete the expense. This aligns with the instruction to delete "School Supplie s" first, and it is the logical n ext step before moving on to the other expenses... Action : Tap the trash can icon next to the 'School Supplies' expe nse to initiate the deletion process.. Step 6: Reasoning: … In the previous step, I in itiated the deletion pro cess for "School Supplies," and now the app is prompting me to confirm the deletion. To proceed w ith deleting the "Religio us" expense, I need to confirm this deletion. The lo gical next step is to tap the "CON FIRM" button. Action : Tap the 'CONFIRM' button in the 'Delete Items' dialog to confirm the deletion o f the selected item. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Step 7: Reasoning: …I see two expenses: "Re ligious and "Fligh t Tickets" both of w hich are visible and selectable. The interaction history shows that I have alre ady deleted the " School Supplies" expense, and no w I need to delete the "Rel igious" and "Fligh t Tickets" e xpenses. Action : Tap on the 'Religious' expe nse in the Re cent section to view its detail s and look for the delete op tion. Step 8: Reasoning : …To delete this expense, I need to access the delete o ption, which is typically represente d by a trash can icon. In the screensh ot, I see a trash can ic on to the righ t of the "Relig ious" expense c ard. Tapping this icon will likely bring up a confirmation dialo g to delete the e xpense Action : Tap the trash can icon next to the 'Rel igious' expense to initiate th e deletion process. Step 9 : Reasoning: … the previous step was to tap the trash can icon for this expense . Now, to complete the del etion, I need to con firm the action by tapping the "CONF IRM" button. This will finalize the deletion of the "Religious" expense, as required by the instruction… Action : Tap the 'CONFIRM' button in the 'Delete Items' dialog to confirm the deletion o f the 'Religious' expe nse. Step 10: Reasoning: … The Rec ent section now only shows "Fl ight Tickets" as a recent expense, which means the other two expenses have been removed. To ensure that the "F light Tickets" expense is also deleted, I need to tap on it to view its details and look for the delete option. ... Action : Tap on the 'Flight Tickets' expens e in the Recent section to view its detail s and look for the delete op tion. Step 12: Reasoning: … The instruction require s me to delete three specific expenses: "Schoo l Supplies," "Relig ious," and "Flight Tickets ." Since the "Flight Tickets" expense is the last one to be deleted, and th e dialog is prompting for conf irmation, the logical next step is to tap the "CONFIRM" button … Action : Tap the 'CONFIRM' button in the 'Delete Items' dialog to confirm the deletion o f the 'Flight Tickets' expens e. Step 11: Reasoning: … The interaction history shows that I have already tapped the trash can icon for the "Schoo l Supplies" and "Rel igious" expenses, and n ow I am at the "Flight Tickets " expense. The next logical step is to tap the trash can icon nex t to the "Fligh t Tickets" e xpense.. . Action : Tap the trash can ico n next to the 'Flight Tic kets' expense to initiate the dele tion process. Step 13: Reasoning: … The onl y remaining expense to delete is "Flight Tickets .”. To do this, I should look fo r a way to acc ess the list of all expenses. On the left side of the screen, there is a menu icon (three horizontal lin es) at the top left co rner. Tapping this icon will l ikely open a navigation drawer o r menu. ... Action : Tap the menu icon (three horiz ontal lines) at the top left corner to o pen the navig ation drawer and access the list of all expenses. Step 14: Reasoning: …Since I hav e already navigated to the navigation drawer an d can see the list of option s, my next lo gical step is to access th e "Expense Logs" section. This will allow me to view the ful l list of all expenses, including the ones I just deleted, so I can ver ify that they are no longer present... Action : Tap on the 'Expens e Logs' option in the navigation drawer to view the li st of all expen ses. Step 15: Reasoning: …Based on the interaction history, I have already deleted these expenses in previous steps. Sinc e there are no expense logs, it is clear that th e deletion process h as been completed and there ar e no further expenses to manage or delete. There are n o additional actions required.... Action : The task is complete as there are no expense logs and the specified expenses have bee n deleted. Figure 14 T ra jectory Example of GUI-Libra-7B for T ask 18 Exp enseDeleteMultiple: Delete the follo wing exp enses from pro exp ense: School Supplies, Religious, Flight Tic kets. 54 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Step 8: Reasoning : I need to delete the " Religious" expense entry. The lon g press action has brought up options, including the option to dele te the entry. Clicking th e delete icon will re move the ex pense. Action : {"action_type": "click", "element": "DELETE", "coordinates": [978, 1664 ]} Step 9: Reasoning : I need to delete the " Religious" expense item. The long pres s action has brought up the del ete option, and I sho uld confirm the deletion to proceed. Action : {"action_type": "click", "element": "CO NFIRM", "coordinates": [546, 2230 ]} Step 10: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as been successfully deleted. Now I need to check if t he "Flight Tickets" expense has been deleted as well. Action : {"action_type": " long_pres s ", "element": "Flight Tickets", "coordi nates": [397, 1780]} Step 11: Reasoning: Th e "DELETE" option for the "Flight Tickets" en try is v isible, indicating that the nex t step is to confirm th e deletion. Action : {"action_type": "click", "element": "DELETE", "coordinates": [978, 1664 ]} Step 12: Reasoning: I need to delete the " Religious" expense item. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 13: Reasoning : The goal is to delete the specified expenses, and the confirmation dialog is displayed, indicating that th e item will be dele ted. The next logical step is to c onfirm the deletion to complete the task. Action : {"action_type": "click", "element": "CO NFIRM", "coordinates": [546, 2230 ]} Step 14: Reasoning: I need to find and delete the "Religious" expense entry. Since it was previously de leted, I s hould check if there are any more entries related to relig ious expenses. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 1: Reasoning : The user wants to delete expen ses from Pro Expense, b ut there is no visible Pro Expens e app icon on the home screen. I need to scroll to find the Pro Expense app. Action : {"action_type": "scroll", "directio n": "down", "element": "Pro Exp ense app", "coordinates": [546, 1238 ]} Step 2: Reasoning : Th e Pro Expense app is not v isible in the current s creenshot. I need to scroll do wn to find the Pro Expense app. Action : {"action_type": "scroll", "directio n": "down", "element": "Pro Exp ense app", "coordinates": [546, 1238 ]} Step 3: Reasoning: Th e Pro Expense app needs to be opened to manage th e expenses. Action : {"action_type": " open_app ", " app_name ": "Pro Expense"} Step 4: Reasoning: The goal is to delete the specified expenses, but curren tly, I can only see the recent expenses listed without any o ption to edit or delete them directly. I need to find a way to access the details of these expenses to proceed with deletion. Action : {"action_type": " long_pres s ", "element": "Scho ol Supplies", "c oordinates": [397, 2158]} Step 5: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" entry is visible in the expense detail section . Long pressing this entry will likely bring up option s including deletion. Action : {"action_type": " long_pres s ", "element": "School Suppli es", "coordinates": [397, 2158 ]} Step 6: Reasoning: The goal is to delete the specified expenses, and curre ntly, the "School Supplies" entry i s selected. The nex t logical step is to delete th is entry. Action : {"action_type": "click", "element": "DELETE", "coordinates": [546, 1238 ]} Step 7: Reasoning : I need to delete the " Religious" expense entry. Long pressing it will likely bring up options including deletio n. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 15: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as already been deleted. Now, I need to check if the "Religious" expense has been deleted. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 16: Reasoning : The goal is to delete the specified expenses, and the "School Supplies" entry has already been deleted. Now, I need to check if the "Religious" expense has been successfully deleted. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 17: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as already been deleted. Now, I need to check if the "Religious" expense has been deleted. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 18: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as already been deleted. Now, I need to delete the " Religious" expense. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 19: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as already been deleted. Now, I need to delete the " Religious" expense. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} Step 20: Reasoning: The goal is to delete the specified expenses, and the "School Supplies" expense h as already been deleted. Now, I need to delete the " Religious" expense. Action : {"action_type": " long_pres s ", "element": "Religious", "coo rdinates": [397, 1780]} 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 15 T ra jectory Example of the base mo del Qw en2.5-VL-7B-Instruct for T ask 18 Exp enseDeleteMultiple: Delete the follo wing exp enses from pro exp ense: School Supplies, Religious, Flight Tic kets. 55 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Reasoning: The user wants to follow specific people (" lahwaacz " , "Koushik" , "Vinta Chen") on GitLab. … I am on the "Projects" page in GitLab… I need to navigate to the "P rofile" or "User" section where follo wers can be a dded...Clicking the profile/ avatar icon usually opens a dropdown with options like "Profile , " "Activity , " and sometimes "Following" or "Followers. " From there, I can find the "Following" section or a search fu nction to look up the users… Action : {" action_type ": "Click" , "value": "None" , "point_2d": [956, 17] } 1 Reasoning: …According to the in teraction history , I have just opened the dropdown menu… I see options like "Edit status…n o direct option t o follow use rs h ere. Typ icall y, to foll ow u sers on GitLab, you need to navigate to their profiles… The next logical st ep is to look for a way to access the user profiles or search for these users… I should look for a way to access the user prof iles or search functionality . Action : { " action_type ": "Click" , "value ": "None" , "point_2d": [835, 197] } 2 Reasoning: I see that I am currently on the "Edit Profile" page in GitLab's User Settings… my goal is to follow speci fic us ers: ' lahwaacz ', 'Koushik' , and 'Vinta C hen' . T o do this, I need to access the user profiles or search functionality to find these users… Using thi s search bar will allow me to quickly loca te the profiles of the u sers I want to follow. Therefore, my next step should b e to use the search bar to l ook up these usernames. Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [193, 22] } 3 Reasoning: I see that the "Search GitLab" bar at the top of th e Gi tLab page is c urrent ly active… The instr uction is to follow specific use rs: ' lahwaacz ' , 'Kou shik' , and 'Vinta Chen' . To do this , I first need to search for these users. Since the search bar is a lready foc used, my next step should be to type the firs t username, ' lahwaacz ', i n t o t h e search bar so I can loca te their profil e and follo w them... Action : {" action_type ": "Write" , "value": " lahwaacz ", " p o i n t _ 2 d " : [ 3 2 5 , 2 6 ] } 4 5 Reasoning: I see that I have typed "lahwaacz " into the sear ch bar at the top of the GitLab page… The first result is "Jakub Klinkovsky " with the username " lahwaacz " displayed underneath. This matches exactly with the us er I need to follow according to the instruction. To proceed… I need to select this user … so I can a ccess their profil e page , where there will likely be a "Follow" button. Therefor e, my ne xt step is to click on the "Jakub Klinkovsky " entry … Action : " action_type ": "Click" , "value ": "None" , "point_2d": [145, 141] } Reasoning: I see that I am now on the profile page of Jakub Klink ovsky (@lahwaacz) on GitLab. The instruction is to follow [' lahwaacz ' , 'Koushik' , 'Vinta Chen ’]… I notice there is a blue "Foll ow" button in the upper right area of the pr ofile header … To f u lf i ll t h e f i r st pa r t o f t he i n st r u c ti o n (f o ll o w i n g @lahwaacz), I need t o click this "Follow" button… which is the requi red action before moving on to search fo r and follo w the other tw o users. Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [950, 79] } 6 Reasoning: … currently on the p rofile page of Jakub Klink ovsky (@lahwaacz) on GitLab. The "Unfollow" button is visible at the top right, whi ch confirms that I have already followed this user . According to the instruction, m y next step is to follow "K oushik" and "Vinta Chen" as well. T o do this, I need to search for these users next… I need to click on the "Sear ch GitLab" bar at the top of the page so I can type in the next username to loo k up. Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [187, 22] } 7 Reasoning: … I am on the GitL ab website, and the search bar at the top is currently active… According to the instruction, I n eed to follo w [' lahwaacz ' , 'Kou shik' , 'Vinta Chen'] on GitLab . I have already follo wed @lahwaacz (Jakub Klinkovsky ) … I need to search fo r "Koushik" next …. so my next st ep should be to type "Kou shik" into the search bar to … allow me to locat e their profile and proceed with following them. Action : {" action_type ": "Write" , "value": "Koushik" , "point_2d": [323, 25] } 8 Reasoning: … I have typed "Kou shik" into the search bar a t the top of the GitLab page… The first suggestion is "Kou shik Dutta" with the usernam e " kous h ". T h i s m a t c h e s t h e u s e r I n e e d t o f o l l o w n e x t according to the instruction. To proceed with following "K oushik, “ … next step is to click on the "Koushik Dutta" user entry in the search result s so I can access their prof ile and look for a "F ollow" button there . Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [140, 140] } 9 Reasoning: … I am now on the profile p age of "Koushik Dutta" (@koush) on GitLab . The instruction is t o follow [' lahwaacz ' , 'K oushik' , 'Vinta Chen’]… I hav e already… navigated to K oushik Dutta's profile… there is a prominent blue "Follow" button in the upper right section of the pro file header … the next logical step is t o click the "Follow" button on this profile page. This will complete the action of f ollowing "Koushik" . Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [950, 80] } 10 Reasoning: …I am currently on the p rofile page of "Koushik Dutt a" (@koush) on Git Lab. The "Unfollow" button is visible in the upp er right corner of the profile, which means this use r is already being followed. According to the instruction, I n eed to … I ha ve already follow ed ' lahwaacz ' and 'Koushik' . Now, I n eed to search f or 'Vinta Chen' a nd follow them. T o do this, I need to initiate a ne w search… page so I ca n type "Vinta Chen" into it. Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [188, 22] } 11 Reasoning: I see that the "Search GitLab" bar at the top of th e Gi tLab page is c urrent ly ac tive and ready for inp ut… I ha ve al ready fol lowed " lahwaacz " and "Koushik" . The next user to follow is "Vint a Chen" . T o procee d, I need to se arch for "Vinta Chen" in the GitL ab search bar. Therefore , my next step is to type "V inta Chen" into th e search bar so I can locate their profile and follow them. Action : {" action_type ": "Write" , "value": "Vinta Chen" , "point_2d": [ 328, 25] } 12 Reasoning: … I have just searched for "V inta Chen" in the GitL ab search bar… The first result is "Vinta Chen" with the username "vin ta" . According to the inst ruction, I need to follow "Vinta Ch en" after following " lahwaacz " and "Ko ushik" . T o proceed, I need to navigate to Vinta Chen's pr ofile page so I can find and click the "Follow " button. … next step is to cl ick on the "Vinta Chen" user entry in the searc h results , which will t ake me t o their profile page . Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [135, 140] } 13 Reasoning: … I am now on Vinta Chen's GitLab profile page . The "Follow" button is visible in the top right corner of the profile section… my goal is to follow the us ers " lahwaacz , " "Kou shik, " and "Vinta Chen. " I have already followed " lahwaacz " and "Koushik. " No w , I need to follow "Vinta Chen. " The "Follow" button is clearly displayed and actionable on this page . Clicking this button will com plete the requi remen t to foll ow al l th ree users as ins truc ted. Action : {" action_type ": "Click" , "value ": "None" , "point_2d": [950, 80] } 14 Reasoning: …I am on Vinta Chen 's GitLab profile page . The "Unfollow" button is visible at the top right, which mea ns that I am currently followi ng this user ... I h ave already follow ed 'lahwaacz ' and 'Koushik' , and no w I have just f ollowed 'Vinta Chen' . All three user s have been f ollowed as p er the instruction. The re are n o further actio ns require d to fulfill the instruction. Action : {" action_type ": " Answer" , "value": " All three users, 'lahwaacz ' , 'Koushik' , and 'Vinta Chen' , have been follow ed on GitLab. " , "point_2d": [ - 100, - 100] } 15 Figure 16 T ra jectory Example of GUI-Libra-4B for W ebArena-Lite-v2 T ask: F ol low [’lahwaacz’, ’Koushik’, ’Vinta Chen ’] on Gitlab. 56 GUI-Lib ra: T raining Native GUI Agents to Reason and Act with Action-a wa re Sup ervision and Pa rtially Verifiable RL Reaso ning: The t ask is t o f ol lo w thr ee sp ecific user s o n GitL ab: ' lahw aac z ' , 'K o ush ik' , and 'Vin t a Chen' . Curr ently , w e ar e on the P r oj ects pag e , whi ch is no t r elev ant t o f ol lo wi ng user s . T o pr oceed, w e need t o navi g a t e t o the pr ofi le or user sectio n wh er e f ol lo wi ng user s is po ss ib le… the lo gi c al fir s t s t ep is t o click o n the user av a t ar in the t op - righ t corner t o op en the dr op do wn menu, whi ch li k ely con t ains opti on s s uc h as "P r ofi le " or "Settin gs " wher e w e can fin d the op tio n t o f ol lo w user s . Action : click (x=957, y=2 1) 1 Reaso ning: Th e dr op do wn me nu under th e use r av a ta r has bee n succ es sfully opened , r ev eal ing op tion s such as "Se t s ta tus , " "Edit p r of ile , " "P r ef er ences , " and "Sign out. " T o pr ocee d wi th f ollo wi ng user s on Gi tL ab , I need to naviga te to th e user' s pr of ile page , w hich is t yp ic ally accessible via th e "Edit p r of ile " op tion . This wi ll allo w me to manag e f ollo wi ng sett ings . Action : click (x=849, y=1 67) 2 Reaso ning: Th e curr ent scr ee n s ho w s th e Use r S etting s page in GitL ab , sp ec ific ally th e Ed it P r ofile section. Th is is not th e cor r ect loc a tion to f ollo w user s , as f ollo wi ng user s is ty pic ally done via t he mai n n avig a tion or sear ch bar … th e ne xt logic al s te p is to naviga te b ack to th e mai n Git L ab inte r f ace whe r e w e c an sear ch f or and f ollo w user s . Th e sea r ch bar a t th e t op of th e scr ee n is th e mos t di r ect w ay to find user s lik e ' lahw aac z ' , 'K oushik' , and 'Vinta Chen ’ … Action : click (x=1 89 , y=2 4) 3 Reaso ning: Th e dr op do wn me nu under th e use r av a ta r has bee n e xpande d , sho w ing opt ions r ela te d to issue s and me r g e r eq ue s ts . Ho w ev er , non e of these op tion s r ela te to f ollo wi ng user s . T o f ollo w user s , w e lik ely need to naviga te to th e “User s ” section o r sear ch f or specific user s di r ectly . Since th e sea r ch bar a t th e t op is av ailable , ty ping th e use r name of one of th e user s (e .g ., ' lahw aac z ') into th e sear ch b ar mi gh t lea d us to th ei r p r of ile , whe r e w e c an f ollo w th em . Action : writ e(" lahw aac z ") 4 5 Reaso ning: Th e sear ch bar cu r r ently contains th e t erm " lahw aac z , " and th e dr op do wn sugge s ts a user named " J ak ub Klink o v sky " wi th t he username " lahw aac z . " T o pr ocee d wi th f ollo wi ng th is user , I should sele ct th is sugge s tion to naviga te to th ei r pr of ile page . Once th er e , I c an f ollo w th em . T his aligns with t he g oal of f ollo wi ng user s on GitL ab . Action : click (x=0 .1 5, y=0 .1 4) Reaso ning: Th e curr ent scr ee n s ho w s… w e need to naviga te b ack to th e mai n Git L ab inte r f ace .. Gi v en th a t w e pr eviou sly sele cte d “ J ak ub Klink o v sky , ” it s ee ms w e may hav e d evia te d fr om th e orig inal g oal. W e should r eturn to th e mai n GitL ab inte r f ace t o begin f ollo wi ng th e corr ect user s… Th e logic al ne xt s te p is to click on t he “User Se tt ings ” menu it em in th e sid ebar to collapse it, allo wi ng us to acce ss th e mai n interf ace m or e ea sily . Action : click (x=1 8, y=9 70 ) 6 Reaso ning: Th e curr ent scr ee n s ho w s th e "Edit P r of ile " page in GitL ab , which is un r ela te d to f ollo wi ng user s . Si nce t he task is to f ollo w specific user s (' lahw aac z ' , 'K oushik' , 'Vinta Che n'), w e need to naviga te b ack to th e ma in GitL ab inte r f ace whe r e user p r of ile s or sear ch funct ionality i s acce ssible . Th e logic al ne xt s te p is to click th e back arr o w icon loc a te d a t th e bo ttom - lef t cor ner of th e scr ee n to r eturn to th e ma in inte r f ace . Action : click (x=1 4, y=9 72 ) 7 Reaso ning: Th e g oal is to f ollo w user s on GitL ab , b ut w e ar e curr ently on the User Se tt ings page . T o pr ocee d, w e nee d to naviga te b ack to th e mai n Git L ab inte r f ace whe r e th e use r p r of ile s or sear ch funct ionality i s acce ssible . Si nce th e sid ebar has b ee n collapsed , clicki ng t he “Collapse sid ebar ” but ton mi gh t help us r eturn to th e mai n inte r f ace mor e efficiently . Action : click( 49 , 97 8) 8 Reaso ning: Th e curr ent scr ee n s ho w s th e "Edit P r of ile " page in GitL ab ' s user sett ings… I need to naviga te b ack to th e mai n Git L ab inte r f ace whe r e user p r of ile s or sear ch funct ionality i s acce ssible… Th e logic al ne xt s te p is to click on t he "Use r Se tt ings " br ead crumb link to naviga te b ack to th e mai n Git L ab inte r f ace , as t his will allo w acce ss to th e ma in nav ig a tion b ar or sear ch f unct ionality nee d ed to find and f ollo w use r s . Action : click (x=1 79 , y=6 8) 9 Reaso ning: Th e g oal is to f ollo w user s on GitL ab , b ut the cur r ent scr ee n is th e pr of ile ed iting page , w hich is unr ela te d to f ollo wi ng user s . T o pr ocee d, I need to naviga te back to th e ma in GitL ab inte r f ace whe r e th e user lis t or sear ch f unct ionality is acce ssible . Th e mos t logic al s te p is to click th e back ar r o w or use th e naviga tion menu t o r eturn to th e ma in inte r f ace . Action : click (x=1 5, y=9 70 ) 10 Reasoning: The g oal is… Cu rr ently , w e ar e on th e Use r Se tt ings page , which is not r ele v ant to f ollo wi ng user s . W e nee d to naviga te b ack to th e mai n Git L ab inte r f ace to acce ss th e user pr of ile s or sear ch bar t o find and f ollo w th ese user s . Since th e pr eviou s s te ps in v olv ed collapsing t he sid ebar and naviga tin g back to th e mai n inte r f ace , it' s logic al to assum e th a t w e need to r eturn to th e mai n Git L ab inte r f ace ag ain to pr ocee d wi th t he task . Action : click (x=1 78 , y=2 1) 11 Reasoning: The g oal is to f ollo w speci fic use r s on Gi tL ab: ' lahw aac z ' , 'K oushik' , and 'Vinta Che n' . Curr ently , w e ar e on the p r of ile sett ings page , which is not whe r e w e c an f ollo w user s . W e nee d to naviga te b ack to th e mai n Git L ab inte r f ace or th e user dir ect or y to find t hese user s and f ollo w th em . Si nce th e pr eviou s s te ps in v olv ed collapsing t he sid ebar and navig a tin g back to th e ma in inte r f ace , it' s logic al to assum e w e need to r eturn to th e mai n GitL ab inte r f ace ag ain to acce ss t he user sear ch o r pr of ile section. Action : click (x=1 10 , y=6 5) 12 Reaso ning: Th e g oal is … Curr ently , w e ar e on th e P r of ile Se tt ings page , w hich is not r ele v ant to f ollo wi ng user s . W e nee d to naviga te b ack to th e mai n Git L ab inte r f ace whe r e use r p r of ile s or sear ch fun ction ality i s accessible… Ho w ev er , th e curr ent scr ee n doe s not sho w an y d ir ec t w ay to se ar ch f or or f ollo w user s . Giv en th is conte xt, th e ne xt s te p should be to click on the "Sea r ch GitL ab " b ar a t th e top of th e scr ee n to initia te a sear ch f or th e user s w e w ant to f ollo w . Action : click (x=1 87 , y=2 1) 13 Reaso ning: Th e g oal is to f ollo w specific user s on Gi tL ab: ' lahw aac z ' , 'K oushik' , and 'Vinta Che n' . Th e curr ent scr ee n s ho w s th e "S ear ch GitL ab " b ar , which is th e app r op r ia te place to begin sear ching f or th ese user s . Since th e sea r ch bar i s alr ead y se le ct ed , th e ne xt logic al s te p is to ty pe th e fir s t username into th e sear ch b ar t o loc a te th em . Action : writ e(" lahw aac z ") 14 Reaso ning: Th e curr ent scr ee n s ho w s th e sear ch bar w ith " lahw aac z " ent er ed , and th e sear ch r esul t f or " J ak ub Klink o v sky " is visible . Si nce th e g oal is to f ollo w user s includ ing " lahw aac z , " and w e hav e al r ea d y se le ct ed " J ak ub Klink o v sky , " th e ne xt logic al s te p is to pr oce ed w ith f ollo w ing th is use r . Ho w ev er , since th e ta sk r eq uir es f ollo wi ng th r ee user s , w e nee d to ensur e w e f ollo w " lahw aac z " as w ell. Giv en th e curr ent s ta te , cl icki ng on the " J ak ub Klink o v sky " user wil l lik ely ta k e us to th ei r p r of ile page , w her e w e c an find a "F ollo w" but ton… Action : click( 14 8, 14 1) 15 Figure 17 T ra jectory Example of Qwen3-VL-4B-Instruct for W ebArena-Lite-v2 T ask: F ol low [’lahwaacz’, ’Koushik’, ’Vinta Chen ’] on Gitlab. 57
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment