Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training

Learning with Challenges: Adaptive Difficulty-Aware Data Generation for Mobile GUI Agent Training
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large-scale, high-quality interaction trajectories are essential for advancing mobile Graphical User Interface (GUI) agents. While existing methods typically rely on labor-intensive human demonstrations or automated model exploration to generate GUI trajectories, they lack fine-grained control over task difficulty. This fundamentally restricts learning effectiveness due to the mismatch between the training difficulty and the agent’s capabilities. Inspired by how humans acquire skills through progressively challenging tasks, we propose MobileGen, a novel data generation framework that adaptively aligns training difficulty with the GUI agent’s capability frontier. Specifically, MobileGen explicitly decouples task difficulty into structural (e.g., trajectory length) and semantic (e.g., task goal) dimensions. It then iteratively evaluates the agent on a curated prior dataset to construct a systematic profile of its capability frontier across these two dimensions. With this profile, the probability distribution of task difficulty is adaptively computed, from which the target difficulty for the next round of training can be sampled. Guided by the sampled difficulty, a multi-agent controllable generator is finally used to synthesize high-quality interaction trajectories along with corresponding task instructions. Extensive experiments show that MobileGen consistently outperforms existing data generation methods by improving the average performance of GUI agents by 1.57 times across multiple challenging benchmarks. This highlights the importance of capability-aligned data generation for effective mobile GUI agent training.


💡 Research Summary

The paper addresses a critical bottleneck in training mobile Graphical User Interface (GUI) agents: the lack of fine‑grained control over the difficulty of generated interaction trajectories. Existing approaches either rely on costly human demonstrations or on model‑driven exploration that produces data without regard to the agent’s current capabilities, leading to a mismatch between training difficulty and what the agent can actually learn from. Inspired by human curriculum learning—progressively tackling more challenging tasks—the authors propose MobileGen, an adaptive data generation framework that aligns task difficulty with the agent’s capability frontier.
MobileGen first decouples trajectory difficulty into two orthogonal dimensions: (1) Structural difficulty, captured by Depth of Trajectory (DoT, the number of steps) and Breadth of Trajectory (BoT, the number of distinct apps visited); and (2) Semantic difficulty, captured by Interaction Control Difficulty (ICD) and Instruction Understanding Difficulty (IUD), each discretized into easy, medium, and hard. This decomposition enables precise, multi‑dimensional difficulty control that prior work lacks.
The pipeline consists of three stages. In the Agent Capability Profiling stage, a “student” agent (the model to be trained) is evaluated on a curated prior dataset (T_p) that spans a wide range of structural and semantic complexities. Dual‑level metrics are computed: structural capability (C_d) (max reliable DoT) and (C_b) (max reliable BoT), together with per‑app vulnerability scores (V_i); semantic capability (C_{int}) (interaction control) and (C_{ins}) (instruction understanding) are derived by weighting successful actions with the corresponding difficulty labels. Pass@K and SoM‑annotated screenshots are used to ensure accurate action‑ground‑truth matching.
From these statistics, MobileGen defines an α‑guided challenge point for each capability dimension: (C^* = C / (1 + \alpha \cdot \eta)), where (\alpha > 0) controls overall aggressiveness and (\eta) is a dimension‑specific scaling factor. By adjusting (\alpha), the system can shift the target difficulty upward as the agent improves. For each dimension, a discrete probability distribution centered around the challenge point is constructed, and a specific difficulty tuple ((\text{DoT}, \text{BoT}, \text{ICD}, \text{IUD})) is sampled.
The sampled difficulty parameters drive the Multi‑agent Controllable Generator (MCG). MCG comprises two cooperating agents: an Explorer, which generates a raw interaction sequence satisfying the structural constraints, and a Supervisor, which verifies and refines the sequence to meet the semantic constraints. The two agents operate in parallel, producing high‑quality trajectories at scale. An inverse‑synthesis step then reconstructs natural‑language instructions that correspond to the generated action sequences, ensuring a paired dataset of trajectories and task descriptions.
Extensive experiments on several public mobile GUI benchmarks (e.g., Android‑Suite, Kuaishou‑Tasks) and on custom high‑difficulty scenarios demonstrate that models trained with MobileGen‑generated data consistently outperform those trained on human demonstrations or on prior model‑exploration data. The average performance gain is 1.57×, with particularly large improvements (over 2×) on long‑horizon, multi‑app tasks. Moreover, the curriculum‑style difficulty adaptation accelerates convergence, reducing the number of training steps needed for a given accuracy by roughly 30 %.
The authors discuss several implications: (1) MobileGen introduces a curriculum learning paradigm for GUI agents, something previously explored mainly in language or robotics domains; (2) the dual‑dimensional difficulty model provides a template for fine‑grained data control in other high‑dimensional interaction spaces; (3) the framework is scalable, as the MCG can generate thousands of trajectories in parallel without human oversight. Future work is suggested in (a) refining difficulty metrics (e.g., incorporating UI complexity measures), (b) personalizing curricula to individual agents or user preferences, (c) extending the approach to other domains such as web automation or embodied robot manipulation, and (d) automating the selection of (\alpha) and (\eta) via meta‑learning.
In summary, MobileGen demonstrates that aligning training data difficulty with an agent’s evolving capabilities dramatically improves learning efficiency and final performance. By systematically profiling capabilities, dynamically shaping difficulty distributions, and leveraging a controllable multi‑agent generator, the paper sets a new standard for data‑centric training of mobile GUI agents, offering a cost‑effective path toward large‑scale, high‑quality interaction datasets.


Comments & Academic Discussion

Loading comments...

Leave a Comment