📝 Original Info
- Title: Optimizing Agentic Language Model Inference via Speculative Tool Calls
- ArXiv ID: 2512.15834
- Date: 2025-12-17
- Authors: Daniel Nichols, Prajwal Singhania, Charles Jekel, Abhinav Bhatele, Harshitha Menon
📝 Abstract
Language models (LMs) are becoming increasingly dependent on external tools. LM-based agentic frameworks frequently interact with their environment via such tools to search files, run code, call APIs, etc. Further, modern reasoning-based LMs use tools such as web search and Python code execution to enhance their reasoning capabilities. While tools greatly improve the capabilities of LMs, they also introduce performance bottlenecks during the inference process. In this paper, we introduce novel systems optimizations to address such performance bottlenecks by speculating tool calls and forcing sequences to remain resident in the inference engine to minimize overheads. Our optimizations lead to throughput improvements of several hundred tokens per second when hosting inference for LM agents. We provide a theoretical analysis of our algorithms to provide insights into speculation configurations that will yield the best performance. Further, we recommend a new "tool cache" API endpoint to enable LM providers to easily adopt these optimizations.
💡 Deep Analysis
📄 Full Content
Optimizing Agentic Language Model Inference via Speculative Tool Calls
Daniel Nichols
Lawrence Livermore National Laboratory
Prajwal Singhania
University of Maryland
Charles Jekel
Lawrence Livermore National Laboratory
Abhinav Bhatele
University of Maryland
Harshitha Menon
Lawrence Livermore National Laboratory
Abstract
Language models (LMs) are becoming increasingly de-
pendent on external tools. LM-based agentic frameworks
frequently interact with their environment via such tools
to search files, run code, call APIs, etc. Further, modern
reasoning-based LMs use tools such as web search and Python
code execution to enhance their reasoning capabilities. While
tools greatly improve the capabilities of LMs, they also intro-
duce performance bottlenecks during the inference process.
In this paper, we introduce novel systems optimizations to
address such performance bottlenecks by speculating tool
calls and forcing sequences to remain resident in the infer-
ence engine to minimize overheads. Our optimizations lead
to throughput improvements of several hundred tokens per
second when hosting inference for LM agents. We provide
a theoretical analysis of our algorithms to provide insights
into speculation configurations that will yield the best per-
formance. Further, we recommend a new “tool cache” API
endpoint to enable LM providers to easily adopt these opti-
mizations.
1
Introduction
Tool and function calling have enabled language models
(LMs) to become useful for tasks beyond just conversation by
providing the ability to interact with external environments
and collect further context [12–14, 20]. In particular, LM-
based agentic tools and frameworks are often entirely reliant
on external tools as they are designed to interact with the
environment to solve some problem or accomplish a task.
Popular agents such as software engineering agents (SWE
agents) [9,19] need to interact with source code files and the
command line to execute actions. Although access to exter-
nal tools yields much richer capabilities and enables LMs to
solve long-horizon, real-world tasks, it also introduces several
performance bottlenecks in the traditional inference pipeline.
Instead of a single, contiguous generation, the model alter-
nates between generation and tool invocation, often across
many concurrent sessions, resulting in gaps due to waiting on
tool completion.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Average Tool Latency (sec)
800
900
1000
1100
1200
Throughput (tok/sec)
+196.4 tok/sec
Overview of Speculative Tool Use
(32 Async gpt-oss-120b Agents)
Our Approach
Vanilla
Figure 1: Our approach leads to up to 196 tokens/second
improvement in the vLLM server when there are 32 gpt-oss-
120b agents using it for inference.
As tool-centric agents become more prevalent in code co-
pilots, personal assistants, and autonomous workflows, the
overhead of using these agents is increasingly dominated by
the time spent waiting on tools. Each tool call forces gen-
eration to stop and return to the user, who handles running
the tool and sending back its output. This interrupt-driven
process introduces a strict sequential dependency, where the
latencies of individual tool calls accumulate and can signifi-
cantly increase total generation time. Additionally, the evicted
sequences and tool output must be rescheduled back into the
inference engine, causing further overhead. Even with opti-
mizations such as prefix-caching [5,7,18,22], there are still
significant overheads to removing the sequences, processing
the tool output, and rescheduling them for generation. These
overheads are further exacerbated in multi-tenant settings
where scheduling and prefix-caching optimizations become
more challenging with many agents contending for resources,
and many concurrent prompts evicting existing entries in the
1
arXiv:2512.15834v1 [cs.PL] 17 Dec 2025
KV-cache. Removing the sequential dependence and reduc-
ing these overheads in tool-heavy workloads is critical as
agentic LMs become more widespread and need to be more
economically viable and efficient.
Breaking the strict sequential dependency introduced by
tool calls and reducing eviction overheads is non-trivial from
a systems perspective. Tool latencies can span orders of mag-
nitude and depend on external services, filesystem I/O, or user-
defined code, making them difficult to predict or bound. Long-
running tools inevitably dominate end-to-end latency, whereas
for short tools, the overheads of eviction and re-entry into the
engine can outweigh any potential latency hiding strategies.
Furthermore, decoupling tool execution from model progres-
sion introduces correctness challenges, and naively running
tools early or out of order can lead to wasted computation
with unused results or outputs that are inconsistent with the
model’s eventual decisions. Finally, existing inference opti-
mizations such as speculative decoding [3,8] and optimized
KV-cache management [7] are implemented entirely within
the inference engine
Reference
This content is AI-processed based on open access ArXiv data.