Curiosity Driven Knowledge Retrieval for Mobile Agents

Curiosity Driven Knowledge Retrieval for Mobile Agents
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mobile agents have made progress toward reliable smartphone automation, yet performance in complex applications remains limited by incomplete knowledge and weak generalization to unseen environments. We introduce a curiosity driven knowledge retrieval framework that formalizes uncertainty during execution as a curiosity score. When this score exceeds a threshold, the system retrieves external information from documentation, code repositories, and historical trajectories. Retrieved content is organized into structured AppCards, which encode functional semantics, parameter conventions, interface mappings, and interaction patterns. During execution, an enhanced agent selectively integrates relevant AppCards into its reasoning process, thereby compensating for knowledge blind spots and improving planning reliability. Evaluation on the AndroidWorld benchmark shows consistent improvements across backbones, with an average gain of six percentage points and a new state of the art success rate of 88.8% when combined with GPT-5. Analysis indicates that AppCards are particularly effective for multi step and cross application tasks, while improvements depend on the backbone model. Case studies further confirm that AppCards reduce ambiguity, shorten exploration, and support stable execution trajectories. Task trajectories are publicly available at https://lisalsj.github.io/Droidrun-appcard/.


💡 Research Summary

The paper introduces a curiosity‑driven knowledge retrieval (C‑KRR) framework designed to boost the reliability of mobile agents that automate smartphone tasks. Mobile agents, powered by large multimodal language models, translate natural‑language instructions into UI actions across installed apps. However, when confronted with complex or previously unseen applications, they often lack precise functional knowledge, leading to planning errors, incorrect API calls, and reduced task success.

To address this, the authors formalize a “curiosity score” that quantifies epistemic uncertainty during execution. The agent models the next UI state as a prior token distribution P and, after observing the actual next state, computes a posterior distribution Q. By aggregating top‑K token probabilities and a residual “OTHER” bucket, they calculate a tail‑adjusted Jensen‑Shannon (JS*) divergence, which serves as an information‑gain estimate for each step. A discounted sum of these divergences across a task episode yields a cumulative uncertainty U(app). When U(app) exceeds a predefined threshold τ, a “curiosity gate” triggers external knowledge retrieval.

The retrieval component queries three heterogeneous sources: (1) online documentation (developer manuals, API guides), (2) source‑code repositories (function definitions, version histories), and (3) historical execution trajectories. Retrieved text fragments are embedded, clustered, and organized into a structured knowledge artifact called an AppCard. An AppCard encodes four key facets of an application: functional semantics, parameter conventions, UI element mappings, and typical interaction patterns. Crucially, AppCards are version‑aware and modular; only the fragments relevant to the current step are injected into the language model’s prompt, preserving the context‑window budget and avoiding spurious associations.

The execution pipeline therefore proceeds as follows: (1) compute the curiosity score for the current state‑action pair; (2) if U(app) > τ, invoke the retrieval module; (3) construct or update the corresponding AppCard; (4) feed the relevant AppCard content into the planner’s prompt; (5) generate the next action; and repeat. This loop allows the agent to dynamically augment its parametric knowledge with up‑to‑date external information, effectively “learning on the fly.”

Empirical evaluation is conducted on the AndroidWorld benchmark, which comprises 527 multi‑step tasks spanning a wide range of Android applications. The authors test several backbone models—including LLaMA‑2‑7B, GPT‑4, and GPT‑5—both with and without C‑KRR. Across all backbones, the curiosity‑driven augmentation yields an average success‑rate increase of roughly six percentage points. When combined with GPT‑5, the system achieves a new state‑of‑the‑art 88.8 % success rate. Ablation studies demonstrate that (a) removing the curiosity trigger (i.e., always retrieving) or (b) retrieving without structuring the knowledge into AppCards both degrade performance, confirming the synergistic importance of uncertainty‑guided retrieval and modular knowledge representation.

Qualitative case studies highlight three practical benefits. First, AppCards disambiguate vague UI labels (e.g., “Enter location”) by providing concrete parameter names, reducing mis‑parameterization. Second, they shorten exploration loops; in a “send email with attachment” scenario, the agent’s trajectory length drops by about 30 % because the relevant file‑picker workflow is supplied directly by the AppCard. Third, version‑aware AppCards maintain stable execution even after UI redesigns, as the agent can map new widget identifiers to the same functional semantics.

The authors discuss limitations and future directions. Current curiosity estimation relies solely on textual token distributions; extending it to multimodal signals (visual embeddings, speech) could capture richer uncertainties. Privacy‑preserving retrieval mechanisms are needed to safely query proprietary documentation. Adaptive thresholding for τ, possibly learned per‑application, may further refine when to invoke external knowledge. Finally, a continuous, automated pipeline for updating AppCards as apps evolve would make the system truly lifelong‑learning.

In summary, the C‑KRR framework demonstrates that integrating a lightweight, uncertainty‑driven trigger with structured external knowledge (AppCards) can substantially improve the planning reliability, generalization, and efficiency of mobile agents operating in complex, dynamic smartphone environments. This work paves the way for more robust, adaptable automation systems that can bridge the gap between static model knowledge and the ever‑changing landscape of mobile applications.


Comments & Academic Discussion

Loading comments...

Leave a Comment