GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing
Distributing LLM inference across geographical regions can improve Time-to-First-Token (TTFT) by regionalizing service deployments. While existing multi-region load balancers save prefill computation by prioritizing Key–Value (KV) Cache hit rate, they ignore cluster networking latency, a critical factor in routing decisions. We introduce GORGO, a method for minimizing TTFT by optimizing a total serving cost as a function of available compute, network latency, and prefix caching. Using extensive profiling on custom infrastructure, we analyze component-level latency bottlenecks and benchmark GORGO against three baselines: (1) naive least-load routing, which ignores prefix-cache overlap; (2) prefix-similarity routing, which selectively pushes requests to the replica with the highest cached-prefix overlap; and (3) a centralized HTTP proxy that runs the GORGO policy while tracking requests across all nodes. We demonstrate that GORGO reduces P99 TTFT through network-aware routing and improves average TTFT by preventing pathological cross-region forwarding. Additionally, we find that GORGO-proxy overcomes synchronization overhead in previous methods and is 2.5x faster on median TTFT, demonstrating the success of a centralized router.
💡 Research Summary
The paper introduces GORGO, a cross‑region load‑balancing framework designed to minimize Time‑to‑First‑Token (TTFT) for large language model (LLM) inference deployed on geographically distributed GPU clusters. Existing multi‑region routers focus on maximizing KV‑cache reuse (i.e., reusing the key‑value cache generated during the pre‑fill phase for prompts that share a prefix) but treat network latency as a secondary concern. In practice, wide‑area round‑trip times (RTTs) between regions can be tens to hundreds of milliseconds, which can dominate TTFT even when cache hit rates are high.
GORGO addresses this by jointly modeling three real‑time signals for each incoming request: (1) Cache locality – an estimate of how much of the pre‑fill work can be saved based on prefix overlap, derived from a radix‑trie index that maps prompt prefixes to regions that have previously cached them; (2) Network latency – measured RTT between the local load balancer and each peer region; and (3) Admission/queue state – the current load on the local continuous‑batching scheduler, expressed as the number of tokens already running or waiting. These signals feed into a simple additive cost function:
Cost(region) = NetWorkLatency(peer)
+ tp · PreFillCost(peer + local)
+ q̂s · QueueWaitTime(local)
where tp is the per‑token pre‑fill time (empirically measured) and q̂s is a tunable weight reflecting the relative importance of queue delay. The model estimates the residual pre‑fill time as (Lp – Lhit)·tp, where Lp is the total pre‑fill length and Lhit is the length of the cached prefix. For each request, GORGO evaluates the cost for all candidate regions and selects the one with the lowest estimated TTFT. If the local region can admit the request immediately into its running batch, the request stays locally; otherwise, the decision may forward it to a remote region if the combined network and cache benefit outweighs the local queue delay.
The system is implemented in two architectural styles. In the distributed version, each region runs a lightweight Go load balancer (loadbalancer.go) that maintains a local prefix index, mirrors the SGLang serving runtime’s queue metrics, and periodically exchanges summarized state (RTT, availability, cache locality estimates) with peers. In the centralized proxy version (GORGO‑proxy), a single HTTP proxy aggregates all peer metadata and performs the cost computation centrally, then forwards the request to the chosen region. Both designs expose per‑request telemetry (creation timestamp, hop sequence, admission time) for offline analysis.
Experiments were conducted on three real cloud regions (US West, Germany, Israel), each equipped with an 8‑GPU A100 node running the Mistral‑7B‑Instruct‑v0.3 model via the SGLang serving stack. Workloads combined the WildChat dialogue trace (which contains multi‑turn conversations with shared system prompts) and GuideLLM synthetic traffic, creating a mix of high‑overlap and low‑overlap prompts. Four request patterns were exercised: (1) Concurrent (fixed number of parallel requests), (2) Poisson (human‑like bursty arrivals), (3) Throughput‑driven (load exceeds server capacity to stress queueing), and (4) Sweep (gradually increasing request rate until latency degrades).
Four baselines were compared: (a) Least‑load routing (ignores cache), (b) Prefix‑similarity routing (optimizes only cache overlap), (c) a standard centralized proxy that uses the same routing logic but without the network‑aware cost model, and (d) GORGO‑proxy (the proposed method).
Key findings:
- Tail latency reduction – By accounting for RTT, GORGO lowered the 99th‑percentile TTFT by roughly 30 % compared with the cache‑only policy, demonstrating that avoiding unnecessary cross‑region hops is critical for tail performance.
- Median latency improvement – The centralized GORGO‑proxy achieved a 2.5× speed‑up in median TTFT over the standard proxy, primarily because it eliminates per‑node synchronization overhead and makes globally optimal decisions.
- Throughput parity – All policies achieved similar overall request‑per‑second throughput, indicating that the latency gains do not come at the expense of capacity.
- Network traffic savings – GORGO reduced inter‑region traffic by about 15 % by keeping more requests local when the queue delay was acceptable.
- Load balancing fairness – GPU utilization across the three regions remained balanced, showing that the cost model naturally spreads load when network costs are comparable.
The authors discuss limitations: maintaining the prefix trie and exchanging summaries scales linearly with the number of regions and the diversity of prompts, potentially increasing memory and bandwidth overhead. RTT measurements can be noisy in volatile network conditions, which may degrade the accuracy of the cost estimate. Moreover, the system does not transfer actual KV‑cache blocks; if a request’s prefix only partially matches cached data, the remaining pre‑fill work still incurs latency.
Future work includes (i) integrating probabilistic models of prefix similarity to predict cache hit probability more accurately, (ii) extending the cost function to incorporate monetary data‑transfer costs in multi‑cloud deployments, and (iii) applying reinforcement‑learning techniques to auto‑tune the weighting parameters (tp, q̂s) based on observed traffic patterns.
In summary, GORGO provides the first practical approach that simultaneously optimizes KV‑cache reuse and inter‑region network latency for LLM inference. Its additive cost model, lightweight per‑region state exchange, and optional centralized proxy achieve substantial reductions in both median and tail TTFT while preserving overall throughput, making it a compelling candidate for production‑grade, globally distributed LLM services.
Comments & Academic Discussion
Loading comments...
Leave a Comment