The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.

💡 Research Summary

The paper provides a comprehensive survey of the rapid evolution of tool‑augmented large language model (LLM) agents, moving from the early focus on single‑tool calls to the current challenge of long‑horizon multi‑tool orchestration. The authors first formalize the problem space, distinguishing the simple “single‑call” setting—where a model selects at most one tool, issues a request, receives a response, and produces a final answer—from the more complex multi‑tool scenario in which the agent repeatedly selects actions from a large inventory of tools, receives feedback, updates an internal state, and decides when to terminate. The multi‑tool formulation is expressed as a sequential decision process with history hₜ, memory mₜ, and environment state sₜ, and the objective balances task success against a cost functional that can include latency, API fees, and safety risks.

The survey is organized around six inter‑related dimensions that together capture the full research landscape:

Inference‑time Planning and Execution – The authors review the shift from linear, “chain‑of‑thought” style planning (e.g., ReAct) to topology‑aware approaches that explicitly model tool dependencies as graphs or AND/OR trees. Notable systems include GAP (Graph‑based Agent Planning), ToolNet, StructuredAgent, and recent graph‑enhanced LLMs that enable parallel execution of independent sub‑tasks. Self‑reflection and self‑improvement mechanisms (Reﬂexion, SPIRAL, MetaAgent) are highlighted for their ability to detect and correct errors during execution, while memory‑augmented agents (MemGPT, LLMCompiler) provide long‑term state persistence.
Training and Trajectory Construction – The paper categorizes methods into (a) training‑free approaches that rely on prompt engineering and tool schema exposure (Toolformer, MCP‑Zero), (b) synthetic trajectory generation pipelines that automatically produce multi‑step tool usage data (Seal‑Tools, BUTTON, APIGen), (c) supervised fine‑tuning with real API logs (Gorilla, Hammer, Chain‑of‑Abstraction), and (d) reinforcement learning frameworks that incorporate cost‑aware reward signals (Port‑Tool, ToolRL, DeepAgent). The authors stress that cost‑aware objectives must consider not only success metrics but also invocation fees, latency budgets, and risk penalties.
Safety and Control – Two sub‑areas are examined: safety in parallel execution (AARM, SagaLLM, Atomix) which addresses state consistency, race conditions, and transactional guarantees; and safety in sequential chains (MINJA, Butterﬂy Effects, LATS) which mitigates prompt injection, privacy leaks, and malicious API usage. Many works embed verification loops that automatically roll back or retry upon detecting anomalous feedback.
Efficiency under Resource Constraints – Strategies for reducing latency and invocation cost are surveyed, including pre‑execution prediction (SoT, LLMCompiler), asynchronous scheduling (MACI), and budget‑aware selection (AnyTool, FrugalGPT). Memory‑based caching (MemGPT) and dynamic tool‑call pruning further lower inference overhead, while cost‑regularized training encourages agents to prefer cheaper tool sequences.
Capability Completeness in Open Environments – The authors discuss mechanisms for (i) recognizing when a required capability lies outside the current tool set (Fail‑TALMS, ToolHaystack), (ii) autonomously expanding the tool inventory through schema generation and code synthesis (LAT‑M, CREATOR, ToolMaker), and (iii) adapting to dynamic, open‑world settings where APIs, UI elements, or data sources evolve (Voyager, ExpeL, AppAgent). These capabilities are crucial for real‑world deployments in enterprise systems and mobile applications.
Benchmark Design and Evaluation – A taxonomy of evaluation suites is presented, covering (a) topological complexity (NESTFUL, ToolHop, TaskBench), (b) temporal scale (Tool Decathlon, UltraHorizon, AgentLongBench), (c) dynamic environments (ToolSandbox, OSWorld, Windows Agent Arena), and (d) state persistence and self‑correction (OdysseyBench, MemAgentBench, CRITICTOOL). The authors argue that future benchmarks must move beyond single‑call correctness to holistic metrics that capture robustness, cost efficiency, and adaptability.

The survey also surveys representative applications: (i) software engineering tools such as CodeTool and HuggingGPT for code generation and debugging; (ii) enterprise workflow automation where agents query databases, generate reports, and orchestrate approvals; (iii) graphical user interface automation using graph‑based planners for web browsing and form filling; and (iv) mobile system control where agents interact with native apps via learned tool wrappers. In each domain, the need for multi‑tool orchestration, state management, and cost‑aware planning is emphasized.

Finally, the authors outline open challenges and future directions: (1) Reliability – developing formal verification and confidence scoring for tool outputs; (2) Scalability – efficient indexing and retrieval mechanisms for thousands of tools; (3) Auditable Execution – transparent logging and visualisation of the decision‑making pipeline; (4) Multi‑Agent Collaboration – protocols for multiple LLM agents to share tool capabilities and coordinate tasks; and (5) Standardization – unified schemas and feedback formats for tools to promote interoperability. By mapping the field across these dimensions, the paper provides a roadmap for building LLM agents that are not only capable of invoking tools but can reliably orchestrate complex toolchains at scale, paving the way toward truly autonomous AI assistants.

The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration

💡 Research Summary

Comments & Academic Discussion

Leave a Comment