AgentServe: Algorithm-System Co-Design for Efficient Agentic AI Serving on a Consumer-Grade GPU
Large language models (LLMs) are increasingly deployed as AI agents that operate in short reasoning-action loops, interleaving model computation with external calls. Unlike traditional chat applications, these agentic workloads require inference serving systems to balance low latency, stable token emission, and throughput under multiple request arrivals from different AI agents. Recent deployments highlight a shift toward running small language models (SLMs) locally on consumer-grade GPUs, driven by privacy, compliance, and cost constraints. When heterogeneous requests overlap on a single GPU, long prefills and short decodes contend for resources, creating head-of-line blocking that destabilizes interactive performance. By analyzing agent workloads, we observe that their execution naturally separates into cold prefills, which process long system prompts, resume prefills, which append tool outputs to cached contexts, and short decodes, which are latency-critical. This mix intensifies contention compared to conventional chatbot serving. We present AgentServe, a single-GPU serving system that ensures stable multi-agent execution under such conditions by isolating prefills from decodes, applying dynamic budgeting to resume prefills, and allocating GPU resources through pre-established CUDA Green Context slots with adaptive control. Evaluation results show that AgentServe significantly improves latency stability while sustaining competitive throughput, achieving up to 2.8x TTFT improvement and 2.7x TPOT improvement over state-of-the-art baselines across different settings.
💡 Research Summary
The paper addresses a pressing challenge in the deployment of small language models (SLMs) as AI agents on a single consumer‑grade GPU. Modern AI agents differ from traditional chatbots: they operate in short reasoning‑action loops, repeatedly interleaving model inference with external tool calls. Each agent session consists of three distinct phases: (1) a cold pre‑fill, which processes a long system prompt without any cached KV states; (2) resume pre‑fills, which append tool outputs or retrieved information to the existing KV cache; and (3) short decodes, which generate a few structured tokens (often just a function call or routing token). Because cold pre‑fills are compute‑intensive and can occupy the GPU’s streaming multiprocessors (SMs) for a long time, they block the latency‑critical short decodes of other agents, causing head‑of‑line (HoL) blocking. This manifests as spikes in time‑to‑first‑token (TTFT) for new requests and increased time‑per‑output‑token (TPOT) for ongoing streams.
Existing serving systems such as vLLM, SGLang, and DistServe mitigate the pre‑fill/ decode imbalance primarily through pre‑fill‑decode (PD) disaggregation or chunked pre‑fill. PD disaggregation works well in multi‑GPU clusters where KV transfer overhead can be amortized, but on a single GPU it still incurs inter‑process coordination and memory copy costs, and it does not guarantee strict isolation of decodes. Chunked pre‑fill reduces HoL blocking only when decodes are long enough to absorb the chunking overhead; with agents, decodes are typically only a handful of tokens, so chunk boundaries repeatedly interrupt the token stream.
AgentServe is introduced as a co‑designed algorithm‑system solution for the single‑GPU, multi‑agent regime. Its core contributions are threefold:
-
Phase‑aware request classification – incoming requests are instantly categorized as cold pre‑fill, resume pre‑fill, or short decode. This enables the scheduler to apply different policies to each class.
-
TPOT‑driven scheduling – the scheduler reserves a minimum fraction of SMs for decodes (the “decode budget”) to keep token emission in its most efficient region. The remaining SM capacity is offered to resume pre‑fills under a dynamic token budget that is continuously adjusted based on real‑time TPOT measurements. When the decoder experiences latency, the budget for pre‑fills is reduced, instantly freeing SMs for the decoder. This adaptive mechanism keeps TPOT stable while still allowing pre‑fills to utilize idle resources.
-
CUDA Green Context slots – a set of pre‑allocated CUDA contexts (Green Contexts) are used to separate the memory and execution state of pre‑fills and decodes within a single engine. By binding each phase to a distinct context, KV cache sharing does not cause memory contention, and context switches are cheap because no large KV transfers are needed. Coordination between phases is performed via lightweight shared‑memory queues, avoiding heavyweight inter‑process communication.
The system architecture comprises three layers: an Application Layer (user‑agent‑tool interaction), an Orchestration Layer (request queue, token‑budget manager, SM partitioner), and an Execution Layer (CUDA streams, Green Context management, memory manager). The authors also provide a competitive‑ratio analysis that bounds the throughput loss of pre‑fills under a given decode SLO, showing that AgentServe’s dynamic budgeting stays within a provable fraction of the offline optimum.
Evaluation is performed on two SLMs—Qwen2.5‑7B and Qwen2.5‑3B—running on RTX 5090 and RTX A5000 GPUs. Experiments involve 2‑4 concurrent agents issuing mixed workloads of cold pre‑fills, resume pre‑fills, and short decodes. Compared with state‑of‑the‑art baselines (vLLM, SGLang, DistServe), AgentServe achieves:
- Up to 2.8× reduction in TTFT, meaning the first token appears much faster even when multiple agents start simultaneously.
- Up to 2.7× reduction in TPOT, with token‑generation latency staying below 30 ms and exhibiting minimal variance.
- Stable SM utilization: decodes saturate at roughly 20 % of SMs, after which additional SMs yield diminishing returns. AgentServe keeps decodes in this efficient region while reallocating surplus SMs to pre‑fills, achieving overall GPU utilization above 85 %.
The authors also present profiling data showing that decode throughput rises sharply with the first few SMs and plateaus, whereas pre‑fill throughput grows more linearly. This validates the intuition behind allocating a small, protected SM slice to decodes.
Discussion and Limitations – AgentServe is tailored to SLMs up to ~7 B parameters; larger models would exceed the memory capacity of consumer GPUs and would need additional techniques (e.g., model parallelism). The dynamic budgeting relies on accurate TPOT measurement; highly irregular external tool latencies could momentarily destabilize the budget, suggesting future work on more robust prediction models. Finally, extending the design to multi‑GPU clusters would require integrating AgentServe’s intra‑GPU isolation with existing PD‑disaggregation frameworks.
Conclusion – By jointly redesigning the scheduling algorithm and the low‑level GPU execution environment, AgentServe eliminates the pre‑fill‑decode contention that hampers multi‑agent serving on a single consumer GPU. The result is a system that delivers low‑latency, stable token streams while maintaining high throughput, making local AI agents viable for privacy‑sensitive, cost‑constrained, and edge‑computing scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment