Optimizing Agentic Language Model Inference via Speculative Tool Calls

Reading time: 5 minute
...

📝 Original Info

  • Title: Optimizing Agentic Language Model Inference via Speculative Tool Calls
  • ArXiv ID: 2512.15834
  • Date: 2025-12-17
  • Authors: Daniel Nichols, Prajwal Singhania, Charles Jekel, Abhinav Bhatele, Harshitha Menon

📝 Abstract

Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best performance. Further, we recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.

💡 Deep Analysis

📄 Full Content

Optimizing Agentic Language Model Inference via Speculative Tool Calls Daniel Nichols Lawrence Livermore National Laboratory Prajwal Singhania University of Maryland Charles Jekel Lawrence Livermore National Laboratory Abhinav Bhatele University of Maryland Harshitha Menon Lawrence Livermore National Laboratory Abstract Language models (LMs) are becoming increasingly de- pendent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also intro- duce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the infer- ence engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best per- formance. Further, we recommend a new “tool cache” API endpoint to enable LM providers to easily adopt these opti- mizations. 1 Introduction Tool and function calling have enabled language models (LMs) to become useful for tasks beyond just conversation by providing the ability to interact with external environments and collect further context [12–14, 20]. In particular, LM- based agentic tools and frameworks are often entirely reliant on external tools as they are designed to interact with the environment to solve some problem or accomplish a task. Popular agents such as software engineering agents (SWE agents) [9,19] need to interact with source code files and the command line to execute actions. Although access to exter- nal tools yields much richer capabilities and enables LMs to solve long-horizon, real-world tasks, it also introduces several performance bottlenecks in the traditional inference pipeline. Instead of a single, contiguous generation, the model alter- nates between generation and tool invocation, often across many concurrent sessions, resulting in gaps due to waiting on tool completion. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Average Tool Latency (sec) 800 900 1000 1100 1200 Throughput (tok/sec) +196.4 tok/sec Overview of Speculative Tool Use (32 Async gpt-oss-120b Agents) Our Approach Vanilla Figure 1: Our approach leads to up to 196 tokens/second improvement in the vLLM server when there are 32 gpt-oss- 120b agents using it for inference. As tool-centric agents become more prevalent in code co- pilots, personal assistants, and autonomous workflows, the overhead of using these agents is increasingly dominated by the time spent waiting on tools. Each tool call forces gen- eration to stop and return to the user, who handles running the tool and sending back its output. This interrupt-driven process introduces a strict sequential dependency, where the latencies of individual tool calls accumulate and can signifi- cantly increase total generation time. Additionally, the evicted sequences and tool output must be rescheduled back into the inference engine, causing further overhead. Even with opti- mizations such as prefix-caching [5,7,18,22], there are still significant overheads to removing the sequences, processing the tool output, and rescheduling them for generation. These overheads are further exacerbated in multi-tenant settings where scheduling and prefix-caching optimizations become more challenging with many agents contending for resources, and many concurrent prompts evicting existing entries in the 1 arXiv:2512.15834v1 [cs.PL] 17 Dec 2025 KV-cache. Removing the sequential dependence and reduc- ing these overheads in tool-heavy workloads is critical as agentic LMs become more widespread and need to be more economically viable and efficient. Breaking the strict sequential dependency introduced by tool calls and reducing eviction overheads is non-trivial from a systems perspective. Tool latencies can span orders of mag- nitude and depend on external services, filesystem I/O, or user- defined code, making them difficult to predict or bound. Long- running tools inevitably dominate end-to-end latency, whereas for short tools, the overheads of eviction and re-entry into the engine can outweigh any potential latency hiding strategies. Furthermore, decoupling tool execution from model progres- sion introduces correctness challenges, and naively running tools early or out of order can lead to wasted computation with unused results or outputs that are inconsistent with the model’s eventual decisions. Finally, existing inference opti- mizations such as speculative decoding [3,8] and optimized KV-cache management [7] are implemented entirely within the inference engine

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut