LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV$ to accelerate the request execution and whether the workload is balanced across instances. Achieving both objectives is challenging because pursuing one objective may compromise the other. Current approaches adopt various combinators (e.g., linear combinations) to compute a scheduling score combining indicators for the two objectives, which are complex in that they either require significant workload-specific hyperparameter tuning or model-hardware-aware simulator development, and could still lead to suboptimal performance. In this paper, we show that using a simple multiplication of two carefully chosen indicators-one for KV$-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can simultaneously achieve both objectives well without any hyperparameter tuning. The key idea is that the multiplied score considers both objectives in a manner similar to a linear combination, with a nice property that the original hyperparameters are canceled out during comparison so we don’t need tuning to find the best parameters. The two indicators are chosen based on our analysis of LLM characteristics, and our extensive experiments show that this simple approach can reduce TTFT by 92% and 52%, and TPOT by 21% and 20%, compared to vLLM-v1 and a production scheduler on real-world workloads covering chatbots, API calls, and coding agents. We also mathematically derive the conditions under which multiplication may fail, and find that such conditions are extremely rare in practice and can be detected (and mitigated) beforehand.


💡 Research Summary

The paper tackles the problem of routing large‑language‑model (LLM) inference requests across a cluster of serving instances. Two competing objectives must be satisfied: (1) exploiting the key‑value (KV) cache so that a request can reuse previously computed KV entries (thereby reducing the costly pre‑fill phase), and (2) keeping the workload balanced across instances to avoid queuing delays. Existing approaches combine separate indicators for these objectives using weighted sums, filter‑then‑select heuristics, or sophisticated simulators that predict latency based on model‑hardware characteristics. All of these methods either require workload‑specific hyper‑parameter tuning, are brittle under dynamic traffic, or demand costly simulator development and maintenance.

The authors propose a remarkably simple scoring function: multiply two carefully chosen per‑instance indicators. The KV‑aware indicator is the number of new pre‑fill tokens that would be generated if the request were routed to a given instance (i.e., the amount of KV cache that can be reused). The load‑balancing indicator is the current batch size (the number of requests already queued or running on the instance). The scheduling score for an instance i is therefore

 Score_i = (new pre‑fill tokens)_i × (current batch size)_i.

Because the two factors are multiplied, any positive scaling coefficients that would appear in a weighted‑sum formulation cancel out when comparing scores across instances. Consequently, no hyper‑parameter tuning is needed, yet the ranking of instances mirrors that of a properly weighted linear combination.

The paper provides a mathematical analysis showing that the multiplication preserves the ordering of a weighted sum whenever the weights are positive—a condition naturally satisfied in practice. It also derives the rare pathological case where the product could mis‑rank instances (extreme asymmetry between the two factors) and proposes a lightweight two‑phase detection mechanism that falls back to a pure load‑balancing policy when such a case is detected.

To evaluate the approach, the authors built LMetric, a Rust‑based framework that implements a unified indicator factory and a domain‑specific language for defining scheduling policies in a few lines of code. LMetric can collect indicators such as queued batch size (Q‑BS), running batch size (R‑BS), and new pre‑fill tokens directly from the serving engine’s responses, enabling fair, apples‑to‑apples comparisons of different policies.

Experiments were conducted on a 16‑GPU H20 cluster (96 GB HBM per GPU) using the latest vLLM‑v1 serving engine. Two representative models were tested: a dense Qwen2‑7B and a mixture‑of‑experts Qwen3‑30B. Real‑world workloads covering chatbots, API‑calling agents, and coding assistants were used, sourced from Alibaba Cloud and open‑source traces. Compared against the state‑of‑the‑art vLLM scheduler, AI‑Dynamo, LLM‑D, and a proprietary scheduler used by Baichuan, the multiplication‑based policy achieved:

  • TTFT (time‑to‑first‑token) reductions of 92 % vs. vLLM‑v1 and 52 % vs. the production scheduler.
  • TPOT (time‑per‑output‑token) reductions of 21 % and 20 % respectively.
  • More uniform batch‑size distribution across instances, indicating better load balancing.

The results show that the simple product score matches or exceeds the performance of far more complex methods while eliminating the need for per‑workload tuning or simulator development. The authors also discuss the practical implications of PD‑colocated serving (prefill and decode on the same instance) versus PD‑disaggregated serving, noting that their method focuses on the former, which remains common in many deployments.

In summary, the paper demonstrates that a single multiplication of two well‑chosen indicators can simultaneously capture KV‑cache awareness and load‑balancing, delivering substantial latency improvements with negligible implementation overhead. The proposed LMetric framework further simplifies experimentation and adoption, making the approach attractive for large‑scale LLM service operators seeking both performance gains and operational simplicity.


Comments & Academic Discussion

Loading comments...

Leave a Comment