AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, limiting LLM agents’ adaptability to new or evolving toolsets. We present AutoTool, a framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. We first construct a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Building on this data foundation, AutoTool employs a dual-phase optimization pipeline: (i) supervised and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce ranking to refine consistent multi-step tool selection. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

💡 Research Summary

AutoTool introduces a novel framework that equips large language model (LLM) agents with the ability to dynamically select and integrate external tools during reasoning, addressing a key limitation of prior agentic reinforcement learning approaches that assume a static, pre‑defined tool inventory. The authors first construct a massive 200 k‑instance dataset that explicitly incorporates tool‑selection rationales. This dataset spans more than 1,000 distinct tools—including code interpreters, web search APIs, and image‑processing modules—and covers over 100 diverse tasks across mathematics, scientific reasoning, code generation, and multimodal understanding. Each data point contains a full chain‑of‑thought (CoT) trajectory, a natural‑language justification for why a particular tool is chosen at a given step, and the subsequent tool‑integration output, all curated through a pipeline that leverages expert models (DeepSeek‑R1) for rationale generation and LLM‑as‑a‑judge for quality filtering.

Training proceeds in two phases. Phase I stabilizes long‑form reasoning by first applying supervised fine‑tuning (SFT) and then reinforcement learning (PPO) to align the policy with coherent CoT generation and correct tool‑integration behavior. Phase II focuses exclusively on tool selection. The authors cast each tool‑selection step as a Plackett‑Luce (PL) ranking problem: every tool is embedded together with its metadata using the LLM’s internal embedding layer, producing a set of tool embeddings E_T. When the model reaches a selection step, it first generates a selection rationale s_i and predicts an embedding e′_i. The probability of choosing tool t_k is then proportional to exp(−γ‖e′i − e{t_k}‖²), effectively turning tool choice into a distance‑based softmax over the embedding space. A KL‑regularized cross‑entropy loss aligns the model’s predicted distribution with the PL ranking derived from the annotated rationales, encouraging the policy to prefer higher‑ranked tools while preserving the overall language modeling objective.

The framework is evaluated on ten benchmarks spanning four major domains: (1) math and science reasoning, (2) search‑based question answering, (3) code generation, and (4) multimodal understanding. Two backbone models—Qwen3‑8B and the multimodal Qwen2.5‑VL‑7B—are fine‑tuned with AutoTool. Despite having fewer parameters than many state‑of‑the‑art agents, AutoTool consistently outperforms them, delivering average absolute gains of 6.4 % on math/science, 4.5 % on search QA, 7.7 % on code tasks, and 6.9 % on multimodal tasks. Importantly, the authors conduct “unseen‑tool” experiments where tools that never appeared during training are introduced at inference time; AutoTool agents successfully select and invoke these novel tools, demonstrating robust generalization to evolving toolsets.

Ablation studies confirm that both phases are necessary: removing Phase I degrades reasoning coherence, while omitting the PL‑ranking loss reduces tool‑selection accuracy. The embedding‑anchored selection mechanism proves effective for handling large, dynamic inventories because new tools can be added simply by supplying their metadata embeddings without retraining the entire policy.

The paper acknowledges limitations: the current approach assumes relatively well‑structured tool metadata and does not yet address complex authentication, rate‑limiting, or billing constraints that real‑world APIs often impose. Moreover, the PL ranking treats tool choices independently and may not capture inter‑tool dependencies; future work could integrate graph‑based dependency models or hierarchical selection strategies.

Overall, AutoTool offers a practical and scalable solution for building LLM agents that can operate in open, continuously evolving tool ecosystems, paving the way for more adaptable AI assistants in real‑world applications.

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

💡 Research Summary

Comments & Academic Discussion

Leave a Comment