FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Modern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload'…

Authors: Huamin Chen, Xunzhuo Liu, Yuhan Liu

FleetOpt : Analytical Fleet Pro visioning for LLM Inference with Compress-and-Route as Implemen tation Mec hanism Huamin Chen 1 Xunzh uo Liu 1 Y uhan Liu 2 Junc hen Jiang 3 Bo wei He 4 ∗ Xue Liu 4 1 vLLM Seman tic Router Pro ject 2 Univ ersity of Chicago 3 T ensormesh Inc / UChicago 4 MBZUAI / McGill Univ ersity 2026 Abstract Mo dern LLM GPU fleets are provisioned for worst-case context lengths that the v ast ma jorit y of requests nev er approac h, w asting GPU capacity on idle KV-cache slots. W e presen t FleetOpt , a framew ork that starts from first principles: given a workload’s prompt- length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically , then deploy it in practice. The analytical core mo dels each p ool as an M/G/ c queue and derives that the minimum- cost fleet is a t wo-po ol arc hitecture — a short-context p ool P s and a long-con text p ool P l — with an optimal boundary B ∗ short satisfying an equal marginal GPU cost condition across b oth p o ols. The fundamental barrier to achieving B ∗ short is the c ost cliff : a hard routing step where requests just ab o ve B ∗ short consume 8 × – 42 × more GPU capacity than requests just b elo w it (depending on the context window ratio), creating a structural disincen tive to lo wer the b oundary . Compress-and-Route ( C&R ) is the implementation mec hanism that resolv es this bar- rier. Gatew ay-la y er extractive compression trims b orderline requests b elo w B ∗ short b efore the engine ev er sees them, con verting the hard hardw are b oundary into a softw are parameter read from the workload CDF. The tw o comp onen ts are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal ( n ∗ s , n ∗ l , B ∗ short , γ ∗ ) in under 1 ms. On three pro duction traces, the combined framework reduces total GPU cost b y 6–82% v ersus a homogeneous fleet, with C&R con tributing 1–44 p ercen tage p oin ts b ey ond plain p ool routing dep ending on workload arc hetype. The analytical mo del is v alidated against a discrete-ev ent sim ulator ( inference-fleet-sim ) with ≤ 3% error on predicted GPU utilization across all p ools and workloads. 1 In tro duction Mo dern LLMs supp ort context windows of 128K tok ens or more, y et pro duction traces rev eal a p ersisten t mismatch: in the Azure LLM Inference T race, 80% of requests use fewer than 2K tok ens [ Patel et al. , 2024 , Zhang and Shen , 2024 , Agraw al et al. , 2024c ]. A homogeneous fleet pro visioned for worst-case context length allo cates KV-cache capacity that go es almost en tirely un used f or the v ast ma jorit y of requests. Pool routing [ Chen et al. , 2026b ] addresses this by splitting the fleet into short- and long-context p ools, cutting GPU cost by 16–38% in scenarios studied b y prior w ork. ∗ Corresp onding author: Bowei.He@mbzuai.ac.ae 1 The residual problem: the cost cliff. Ev en with p o ol routing, a structural inefficiency remains at the p o ol b oundary B short . The routing decision is binary: a request at L total = B short fills one of n ( s ) max short-p ool slots; a request one token longer must en ter the long p o ol, whic h offers only n ( l ) max = 16 slots p er GPU — a throughput-capacit y p enalt y of 8 × – 42 × for a single tok en (Section 2.2 ). Requests in the b or derline b and ( B short , γ B short ] are not genuinely long; they are RA G payloads with one extra paragraph, or m ulti-turn sessions whose history just crossed the threshold. F or the w orkloads we study , 4.6–11.2% of all traffic falls in this band. Eac h b orderline request is assigned a long-p ool slot provisioned for 64K tok ens while us ing at most 1 /ρ of its KV budget. Our approach: analytical mo del first, then implement. Prior w ork treats p o ol routing and Compress-and-Route ( C&R ) as indep enden t op erational interv entions: deploy p ool routing, then optionally retrofit C&R on the existing fleet. This framing leav es v alue on the table. W e take a differen t path. W e start fr om the analytic al optimum : given a workload CDF and a P99 TTFT target, what is the minim um-cost fleet? Th e M/G/ c queuing mo del answers this precisely: a tw o-p ool architecture with a sp ecific b oundary B ∗ short . But the cost cliff is a barrier to ac hieving B ∗ short — lo wering the b oundary forces b orderline requests to the exp ensiv e long p ool. C&R is the implementation me chanism that resolves this barrier. By compressing b orderline requests b elo w B ∗ short at the gatewa y , C&R makes the analytically-optimal b oundary achiev able in practice. The com bined framework do es not merely retrofit compression on to an existing fleet; it provisions the correct fleet from the start, using C&R to enforce the b oundary the mo del prescrib es. Key results. 1. The minimum-cost fleet is analytically deriv ed. Under the M/G/ c model (with serv er coun t c = n gpus × n max KV slots), the optimal t w o-p o ol fleet satisfies an equal marginal GPU cost condition across b oth p ools. GPU counts come from Erlang-C inv er- sion at each p ool’s arriv al rate and service distribution (Section 3 ). 2. The cost cliff is a barrier to ac hieving B ∗ short without compression. A t the optimal b oundary , the b orderline band con tains 43–76% of ab o ve-threshold traffic for real w orkloads (4.6–11.2% of all traffic). Without compression, the op erator must either raise B short (losing savings) or ov er-provision the long p o ol (pa ying more) to absorb b orderline load. 3. C&R con v erts the hard b oundary into a softw are knob. Gatew ay-la yer extractive compression trims b orderline prompts in 2–7 ms. The compressed token budget is chosen so KV ov erflow is imp ossible b y construction (Section 5 ). The effectiv e routing b oundary shifts from B short to γ B short , where γ ∗ comes from the planner sweep. 4. Com bined savings: 6–82% vs. homogeneous. On three pro duction traces, the Flee- tOpt framew ork achiev es 6.7–82.4% GPU cost reduction versus homogeneous deplo y- men t. C&R contributes an incremen tal 1.2 pp (Agent-hea vy), 15.9 pp (LMSYS), and 43.7 pp (Azure) b ey ond plain p ool routing (Section 7 ). 5. Analytical mo del v alidated by DES. inference-fleet-sim [ Chen et al. , 2026c ] con- firms analytical GPU utilization predictions match sim ulation within 3% across all work- loads and p o ols, op erating in the many-serv er regime where queueing delays are negligible. P ap er organization. Section 2 characterizes the cost cliff and workload arc hetypes. Section 3 dev elops the analytical fleet model. Section 4 derives optimal fleet sizing and the optimal b oundary . Section 5 presen ts C&R as the implemen tation mechanism. Section 6 giv es the FleetOpt offline planner. Section 7 ev aluates on pro duction traces. Sections 8 and 9 discuss related w ork and conclude. 2 2 Bac kground: The Cost Cliff and W orkload Archet yp es 2.1 P o ol Routing Basics A request r is assigned a total tok en budget L total = ⌈| r | / ˆ c k ⌉ + r . max_output_tokens , where ˆ c k is a p er-category b ytes-p er-tok en EMA estimate. P o ol routing routes r to P s if L total ≤ B short , else P l . The short p ool is sized for C ( s ) max tok ens (e.g., 8K) with n ( s ) max concurren t sequences p er GPU; the long p ool for C ( l ) max tok ens (e.g., 64K) with n ( l ) max . 2.2 KV-Cac he Memory and the Cost Cliff F or Llama-3-70B in fp16, the KV cache gro ws at 320 KB p er token across 80 la yers. A long- p ool slot sized for 64K tokens requires 64 , 000 × 320 KB ≈ 20 . 5 GB of GPU memory; the same hardw are configured for 8K can host ρ = 8 × as many concurrent sequences, delivering ρ = 8 × the throughput. The GPU savings formula for p o ol routing is α (1 − 1 /ρ ) [ Chen et al. , 2026b ], where α is the fraction of traffic routed to P s . The c ost cliff is the discon tinuit y at B short : a request at B short + 1 tok ens m ust enter the long p o ol, whic h hosts only n ( l ) max = 16 concurrent slots p er GPU (sized for 64K tokens), versus n ( s ) max short-p ool slots. The cliff ratio ρ = n ( s ) max /n ( l ) max dep ends on the short-p ool context window: ρ = 8 × at B short = 8 , 192 , ρ = 16 × at B short = 4 , 096 , and ρ = 42 × at B short = 1 , 536 (T able 2 ). A b orderline request at 1 . 1 × B short uses only 1 /ρ of the long-p ool KV budget it was allo cated. This is not a fla w of p ool routing; it is the una voidable consequence of pro visioning slots for w orst-case context length. The only escap e is to keep requests b elo w B short . T able 1: The cost cliff: throughput capacity consumed p er request at and around B short = 8 , 192 tok ens for Llama-3-70B on A100-80GB ( n ( s ) max = 128 , n ( l ) max = 16 , KV/tok en = 320 KB, long-p ool sized for 64K tokens ≈ 20 . 0 GB p er slot). The cliff ratio is ρ = n ( s ) max /n ( l ) max = 8 × . L total (tok ens) P o ol Slots/GPU KV utilised Cost ratio 8,192 P s 128 100% (2.5 GB / slot) 1.0 × 8,193 P l 16 12.5% of 20.0 GB 8.0 × 12,000 P l 16 18.3% of 20.0 GB 8.0 × 65,536 P l 16 100% (20.0 GB / slot) 8.0 × 2.3 The Borderline Band The b or derline b and ( B short , γ B short ] contains requests whose prompts exceed B short only slightly and could b e compressed b elo w B short at bandwidth ratio γ . Let α = F ( B short ) (the fraction of requests already in P s ) and β = F ( γ B short ) − F ( B short ) (the b orderline fraction). T able 2 sho ws β across workloads. T able 2: Borderline fraction β = F ( γ B short ) − F ( B short ) at represen tative thresholds. α = F ( B short ) ; cliff = n ( s ) max /n ( l ) max for C ( l ) max = 65 , 536 ; Agen t-heavy is a projected workload (Sec- tion 7.1 ). W orkload B short α γ β Cliff ρ Arc het yp e Azure (2023) 4,096 0.898 1.5 0.078 16 × I/I I LMSYS (m ulti-turn) 1,536 0.909 1.5 0.046 42 × I/I I Agen t-heavy 8,192 0.740 1.5 0.112 8 × I I 3 2.4 W orkload CDF Arc hetypes Three qualitatively different workload shap es determine ho w the cost cliff manifests and which remediation is appropriate. Arc het yp e I (Concentrated-below). The CDF has a sharp knee b elo w B short : F ( B short ) ≥ 0 . 90 and the densit y f ( B short ) is high. The b orderline fraction β is mo derate in absolute terms (4.6–7.8%), but the fraction-of-ab o ve-threshold traffic that is b orderline is large (51–76%). P o ol routing already captures most savings; C&R provides meaningful additional sa vings b ecause the cliff ratio ρ is large ( 16 × – 42 × ). Arc het yp e I I (Disp ersed). The CDF spreads across tw o or more decades of token counts. Significan t b orderline traffic exists ( β = 7 – 12% ), and C&R pro vides meaningful incremental sa vings. The agent-hea vy trace (RA G + to ol-use + co de) is the primary representativ e. Arc het yp e I I I (Concen trated-ab o v e). The mass of the distribution lies ab o ve B short (e.g., co de-agen t tasks at 10–50K tok ens). The b orderline fraction is negligible; the dominant lev er is raising B short b efore any compression. 3 Analytical Fleet Mo del 3.1 Queueing Mo del W e mo del each p ool as an M/G/ c queue. Let λ b e the total fleet arriv al rate. After routing: λ s = α ′ λ, where α ′ = α + β p c (1) λ l = (1 − α ′ ) λ (2) Here α = F ( B short ) , β = F ( γ B short ) − F ( B short ) is the b orderline fraction, and p c ∈ [0 , 1] is the compressibilit y rate (fraction of b orderline requests successfully compressed). Service time mo del. GPU iteration latency under con tinuous batching is t iter = W + H · n slots , (3) where W = 8 ms (baseline compute for Llama-3-70B/A100) and H = 0 . 65 ms/slot (p er-slot memory-bandwidth cost) are hardw are constants calibrated to the A100-80GB. Under contin u- ous batc hing, all n max slots adv ance in lo c kstep eac h iteration. A request with L in input tok ens and L out output tok ens o ccupies a slot for E [ S ] =  ⌈ L in /C ch unk ⌉ + L out  · t iter , (4) w all-clo c k seconds (c hunk size C ch unk = 512 ). The GPU-lev el throughput is µ gpu = n max / E [ S ] requests p er second; the squared co efficien t of v ariation is C 2 s = V ar[ S ] / ( E [ S ]) 2 , estimated by Mon te Carlo sampling from the p ool’s request distribution. M/G/ c tail-w ait appro ximation. W e mo del a p ool with n GPUs as an M/G/ c queue with c = n · n max total KV slots as servers, each with p er-slot service rate µ = 1 / E [ S ] . The offered load is ϱ = λ p / ( c µ ) < 1 . The Erlang-C probability (probabilit y a new request m ust w ait for a slot) is C ( c, ϱ ) = ( cϱ ) c / ( c ! (1 − ϱ )) P c − 1 k =0 ( cϱ ) k /k ! + ( cϱ ) c / ( c ! (1 − ϱ )) , (5) The Kim ura [ Kimura , 1994 ] M/G/ c appro ximation gives the P99 queue w aiting time: W 99 ( c, µ, C 2 s ) = ln  C ( c, ϱ ) / 0 . 01  · (1 + C 2 s ) 2 ( c µ − λ p ) . (6) 4 In the many-serv er regime ( c ≫ 1 ), C ( c, ϱ ) ≈ 0 and W 99 ≈ 0 : slots are almost alwa ys av ailable and queueing delays are negligible. Fleet sizing is then dominated by the utilization cap ρ max (Section 4.1 ). 3.2 TTFT Decomp osition Time-to-First-T oken decomp oses as TTFT = W queue + T prefill + T first-decode , (7) where W queue is the queueing dela y b efore a slot is a v ailable, T prefill = ⌈ L in /C ch unk ⌉ · t iter is the ph ysical prefill time (wall-clock, indep enden t of batch size since all n max slots run in parallel), and T first-decode = t iter is one deco de step. The SLO constraint is W 99 ( c, µ, C 2 s ) ≤ T slo − T (99) prefill − t iter , (8) where T (99) prefill is the P99 prefill time computed from the p ool’s request length distribution. 3.3 Cost Mo del GPU cost p er unit time is c s p er short-p ool GPU and c l p er long-p ool GPU. Let ϕ = c l /c s b e the GPU cost ratio. T otal annualized cost for a fleet of n s and n l GPUs is C ( n s , n l ) = c s n s + c l n l . (9) The pr ovisioning pr oblem is min n s , n l ∈ Z > 0 C ( n s , n l ) s.t. Eq. ( 8 ) holds for b oth p o ols. (10) 4 Optimal Fleet Sizing 4.1 P er-P o ol Sizing Because the tw o p ools are indep enden t M/G/ c queues with fixed arriv al rates λ s and λ l , prob- lem ( 10 ) separates in to tw o indep enden t minimizations. F or eac h p ool, the minimum GPU coun t is n ∗ = min  c ∈ Z > 0 : W 99 ( c, µ, C 2 s ) ≤ T slo,eff  , (11) where T slo,eff is the SLO b udget after subtracting P99 prefill time p er Eq. ( 8 ), and we additionally enforce n ∗ ≥ ⌈ λ p / ( ρ max µ ) ⌉ for a utilization cap ρ max = 0 . 85 to ensure analytical stability . 4.2 The Cost Cliff Preven ts Ac hieving B ∗ short Differen tiating total cost C with resp ect to B short and setting to zero (treating n ∗ as a smo oth function of λ p ) giv es the first-order condition for the optimal b oundary: c s ∂ n ∗ s ∂ λ s = c l ∂ n ∗ l ∂ λ l . (12) The λf ( B short ) factor cancels from b oth sides, leaving the condition: the marginal cost of routing one additional request to the short p ool must equal the marginal saving of remo ving one request from the long p ool. Prop osition 1 (Optimal b oundary — equal marginal GPU cost) . Under the M/G/ c c ost mo del with ρ max -c onstr aine d sizing, the pr ovisioning-optimal B ∗ short satisfies Eq. ( 12 ) : e qual mar ginal GPU c ost p er unit tr affic in b oth p o ols. F or a homo gene ous fle et ( c s = c l , same GPU typ e), this holds when b oth p o ols op er ate at the same utilization level. 5 The problem: at any B short , the b orderline band ( B short , γ B short ] contains 43–76% of abov e- threshold traffic for real workloads. Without a mec hanism to redirect b orderline requests to P s , the long p o ol m ust absorb this traffic at ρ = 8 – 42 × higher GPU cost p er request, requir- ing signifi can tly more GPUs than the analytical optim um. C&R resolv es this b y making the analytically-optimal b oundary achiev able in practice. 4.3 Optimal Compression Bandwidth γ ∗ With C&R activ e, b orderline requests at bandwidth γ are redirected to P s . The resulting α ′ = α + β p c shifts λ s up ward and λ l do wnw ard. The first-order condition for the optimal γ is: c s ∂ n ∗ s ∂ λ s · p c · λf ( γ B short ) = c l ∂ n ∗ l ∂ λ l · p c · λf ( γ B short ) , (13) whic h again reduces to the equal marginal cost condition Eq. ( 12 ). In practice γ ∗ is found by discrete sweep ov er γ ∈ { 1 . 0 , 1 . 1 , . . . , 2 . 0 } b ecause n ∗ is integer-v alued and the long-p ool service rate must b e recalibrated for the p ost-compression distribution at each γ . The sweep is fast ( < 1 ms). F or workloads where most ab o v e-threshold traffic is borderline (Arc hetype I/I I), γ ∗ tends to ward large v alues (2.0), as compressing more traffic into P s reduces the exp ensiv e long p ool. F or Archet yp e I I with a disp ersed ab o ve-threshold distribution, γ ∗ reflects the balance b et ween short-p o ol ov erheads and long-p ool savings. Theorem 2 (Co-design is never worse than retrofit) . L et C retro b e the c ost of a fle et size d for p o ol r outing at γ = 1 with C&R later deploye d, and C co b e the c ost of a fle et c o-designe d with C&R at the same γ . Then C co ≤ C retro . Pr o of sketch. The co-designed fleet solv es the same minimization problem as the retrofitted fleet but with the additional freedom to reduce n ∗ l kno wing that λ l will b e lo wer. Hence the feasible set is larger and the minimum cost is w eakly low er. □ 5 Compress-and-Route as Implemen tation Mec hanism C&R is the gatewa y-la yer comp onen t that mak es Prop osition 1 ’s optimal boundary achiev- able in practice. Rather than a separate system, it is the implemen tation of the optimal fleet b oundary derived in Section 4 . 5.1 The Virtual Pool C&R shifts the effective routing boundary from B short to γ B short b y compressing borderline requests at the gatewa y . A request r with B short < L total ≤ γ B short is intercepted; its prompt is compressed to a token budget T c = B short − L out and re-routed to P s . F rom the engine’s p erspective, P s app ears to hav e a higher effective C max : the virtual p ool capacity is γ B short without an y hardware change. The GPU savings gain is ∆ α = β p c , α ′ = α + β p c , (14) and additional GPU savings b ey ond p ool routing are ∆ α (1 − 1 /ρ ) = β p c (1 − 1 /ρ ) . 5.2 Extractiv e Compression Pip eline The compressor is a pure classical-NLP extractive pip eline requiring no LLM inference. Given a b orderline prompt x and tok en budget T c = B short − L out , it: 1. Splits x into sentences with Unico de-a ware heuristics. 6 2. Scores eac h sentence by a comp osite of T extRank ( w = 0 . 20 ) [ Mihalcea and T arau , 2004 ], P osition ( w = 0 . 40 ), TF-IDF ( w = 0 . 35 ) [ Li et al. , 2023a ], and Nov elty ( w = 0 . 05 ). 3. Greedily selects sentences in score order, alwa ys retaining the first 3 and last 2 (pri- macy/recency in v arian t). 4. Stops when the cumulativ e tok en count reaches T c . Hard OOM guaran tee. The budget T c = B short − L out is set b y construction, so no com- pressed request can ov erflow P s ’s KV cache: T c + L out = B short . (15) Con ten t-type safety gate. Compression is applied only to conten t categories where struc- tural extraction is seman tically safe: RAG and prose. Co de is excluded. The category signal reuses the p er-request EMA estimate from the base router at zero additional o verhead. Compression latency . Measured on Intel Xeon Platin um 8568Y+, end-to-end compression tak es 2–7 ms p er b orderline request. W eighted across all traffic, the mean ov erhead is ≤ 0 . 58 ms p er request. Fidelit y . A fidelit y study on 300 LMSYS-Chat-1M prompts from the Agent-hea vy b orderline band ( B short = 8 , 192 , γ = 1 . 5 , band 8K–12K tok ens) gives compressibilit y p c = 1 . 00 for prose/RA G con ten t, BER TScore F1 = 0.884, ROUGE-L recall = 0.856, and TF-IDF cosine similarit y = 0.981 at a mean 15.4% token reduction (App endix C ). Co de is excluded from compression. 6 The FleetOpt Offline Planner Algorithm 1 gives the complete FleetOpt planner. It takes as input the workload CDF F , arriv al rate λ , S LO T slo , GPU profile ( W, H , n ( s ) max , n ( l ) max , C ( s ) max , C ( l ) max ) , and cost ratio ϕ . It returns the optimal ( n ∗ s , n ∗ l , B ∗ short , γ ∗ ) . Algorithm 1 FleetOpt Offline Planner Require: CDF F , λ , T slo , GPU profile, ϕ , candidate threshold set B (hardware-feasible B short v alues) Ensure: ( n ∗ s , n ∗ l , B ∗ short , γ ∗ ) 1: for B ∈ B do ▷ outer swe ep over c andidate b oundaries 2: for γ ∈ { 1 . 0 , 1 . 1 , . . . , 2 . 0 } do 3: α ′ ← F ( B ) + [ F ( γ B ) − F ( B )] · p c 4: λ s ← α ′ λ ; λ l ← (1 − α ′ ) λ 5: Calibrate µ s , C 2 s,s from F restricted to [1 , B ] ▷ short p o ol 6: Calibrate µ l , C 2 s,l from F restricted to ( γ B , ∞ ) ▷ p ost-c ompr ession long p o ol 7: n s ← Erlang-C inv ersion (Eq. ( 11 )) for ( λ s , µ s , C 2 s,s ) 8: n l ← Erlang-C inv ersion (Eq. ( 11 )) for ( λ l , µ l , C 2 s,l ) 9: cost [ B , γ ] ← c s n s + c l n l 10: end for 11: end for 12: ( B ∗ , γ ∗ ) ← arg min B ∈B , γ cost [ B , γ ] return ( n ∗ s [ B ∗ , γ ∗ ] , n ∗ l [ B ∗ , γ ∗ ] , B ∗ , γ ∗ ) Candidate set B . In practice B short is hardw are-constrained: it determines n ( s ) max = n calib max × C calib /B short , whic h m ust b e a p ositiv e integer. The search is therefore o ver the finite set of CDF 7 breakp oin ts that yield v alid n ( s ) max v alues (typically 5–15 candidates p er w orkload). The total sw eep — all B ∈ B times all γ — completes in under 1 ms. Prop osition 1 provides the FOC that characterises B ∗ short analytically; the sweep finds the in teger-optimal solution consistent with hardw are granularit y . Critical: µ l recalibration. Step 6 recalibrates the long-p o ol service rate from the p ost- c ompr ession request distribution (requests with L total > γ B ∗ short ), not the full ab o ve-threshold distribution. Compressing b orderline requests out of the long p o ol hardens the remaining dis- tribution: longer mean token length, lo wer µ l . Skipping this recalibration systematically ov er- estimates the savings from larger γ . With correct recalibration, the planner finds γ ∗ = 2 . 0 for Archet yp e I/I I workloads (Azure, LMSYS) where the long p ool shrinks substantially , and γ ∗ = 1 . 5 for the disp ersed Agent-hea vy workload where gains plateau. Erlang-C in version. The minimum serv er count c satisfying Eq. ( 8 ) is found by binary search o ver the interv al [ ⌈ a/ρ max ⌉ , 10 ⌈ a ⌉ ] where a = λ p /µ gpu , with utilization cap ρ max = 0 . 85 . 7 Ev aluation 7.1 Setup W orkload traces. W e ev aluate on tw o public traces and one synthetic trace built from pub- lished w orkload statistics. Azure LLM Inference T race 2023 [ Patel et al. , 2024 ] contains 28,185 requests (8,819 co ding, 19,366 con versational). Mean L total = 1 , 588 tok ens; p90 = 4,242; p99 = 7,445. W e use B short = 4 , 096 where α = 0 . 898 and β = 0 . 078 , giving a 16 × cliff and a meaningful tw o-p ool split (Arc hetype I/I I). LMSYS-Chat-1M (multi-turn) [ Zheng et al. , 2024 ] uses accumulated con text at each turn. W e use B short = 1 , 536 where α = 0 . 909 and β = 0 . 046 , giving a 42 × cliff (Archet yp e I/I I). Agen t-hea vy is a syn thetic trace derived from published statistics for SWE-b enc h [ Jimenez et al. , 2024 ] (40%), BF CL [ Y an et al. , 2024 ] (25%), and RAG pipelines [ Lewis et al. , 2020 ] (35%). Mean L total = 6 , 511 tok ens; p50 = 4,096; p90 = 16,384; p99 = 32,768. At B short = 8 , 192 , α = 0 . 740 , β = 0 . 112 (Arc hetype I I, the primary scenario for C&R b enefit). The SWE-b enc h comp onen t consists largely of co de; co de requests are excluded from compression (Section 5.2 ). The effective compression success rate for Agent-hea vy b orderline traffic is p c = 0 . 75 , reflecting that appro ximately 25% of b orderline requests are co de-category and cannot b e compressed. Sim ulation parameters. W = 8 ms, H = 0 . 65 ms/slot, C ch unk = 512 , calibrated to Llama- 3-70B on A100-80GB (8-GPU tensor-parallel no de). The long p o ol is sized for C ( l ) max = 65 , 536 tok ens, giving n ( l ) max = 16 concurrent slots per GPU and KV memory ≈ 20 . 0 GB p er slot (320 KB/tok en). Short-p o ol n ( s ) max dep ends on B short : 256 at 4K, 682 at 1.5K, 128 at 8K. GPU cost $2.21/GPU-hr. Fleet arriv al rate λ = 1 , 000 req/s; P99 TTFT target T slo = 500 ms unless stated otherwise. DES v alidation uses inference-fleet-sim [ Chen et al. , 2026c ]. Baselines. 1. Homogeneous : single p o ol sized for 64K con text. 2. P o ol routing (PR) : tw o p ools at workload-specific B short (T able 2 ), γ = 1 . 0 (no com- pression). 3. PR + C&R (retrofit) : C&R at γ = 1 . 5 deploy ed on the PR fleet, co-sized for the compressed λ s . 4. FleetOpt (co-design) : fleet co-designed by Algorithm 1 at optimal γ ∗ . 8 7.2 Fleet GPU Savings vs. Homogeneous T able 3 sho ws fleet GPU counts and GPU savings v ersus the homogeneous baseline. The FleetOpt fleet is the outcome of the complete framework: optimal analytical sizing plus C&R . T able 3: Fleet GPU counts and annualized cost at λ = 1 , 000 req/s, ρ max = 0 . 85 , homogeneous A100-80GB sized for 64K con text. Cost at $2.21/GPU-hr × 8,760 hr/yr. γ ∗ from Algorithm 1 . p c = 1 . 0 for Azure and LMSYS (prose/RAG b orderline traffic); p c = 0 . 75 for Agent-hea vy (25% of b orderline traffic is co de, excluded from compression). See T able 2 for B short , α , β . W orkload Method n s n l T otal Ann. cost (K$) Savings Azure ( B short = 4 , 096 ) Homogeneous — — 284 5,498 — P o ol routing (PR) 43 131 174 3,369 38.7% PR + C&R ( γ = 1 . 5 ) 47 45 92 1,781 67.6% FleetOpt ( γ ∗ = 2 . 0 ) 48 2 50 968 82.4% LMSYS ( B short = 1 , 536 ) Homogeneous — — 139 2,691 — P o ol routing (PR) 7 74 81 1,568 41.7% PR + C&R ( γ = 1 . 5 ) 7 65 72 1,394 48.2% FleetOpt ( γ ∗ = 2 . 0 ) 7 52 59 1,142 57.6% Agen t-heavy ( B short = 8 , 192 ) Homogeneous — — 2,397 46,405 — P o ol routing (PR) 229 2,037 2,266 43,869 5.5% PR + C&R ( γ = 1 . 5 ) 255 1,981 2,236 43,288 6.7% FleetOpt ( γ ∗ = 1 . 5 ) 255 1,981 2,236 43,288 6.7% Sa vings decomp osition. The num b ers in T able 3 follow a clear pattern. P o ol routing alone sa ves 5.5–41.7%, with the biggest gains when α is high and the cliff ratio ρ is large. C&R with γ = 1 . 5 adds 29 p ercen tage p oin ts for Azure (b ecause β = 7 . 8% mo ves 76% of ab o v e-threshold traffic in to the efficient short p ool, and ρ = 16 × makes eac h redirected request highly v aluable) but only 1.2 pp for Agent-hea vy (b ecause 26% of all traffic remains ab o ve γ B short , requiring a large long p ool regardless). FleetOpt at γ ∗ = 2 . 0 for Azure nearly eliminates the long p o ol (2 GPUs), ac hieving 82.4% sa vings. When do es C&R add v alue? The incremen tal GPU sa ving from adding C&R to p ool routing is ∆ α (1 − 1 /ρ ) = β p c (1 − n ( l ) max /n ( s ) max ) . This is large when the b orderline fraction β is significan t and the cliff ratio ρ is large. F or Agent-hea vy , β = 11 . 2% is mo derate but ρ = 8 × is the smallest cliff, and p c = 0 . 75 (co de exclusion), limiting the p er-request gain. F or Azure, β = 7 . 8% paired with ρ = 16 × yields large incremen tal gains. Wh y co-design equals retrofit for Agen t-hea vy . T able 3 shows that for Agen t-heavy , the PR+ C&R retrofit at γ = 1 . 5 and FleetOpt at γ ∗ = 1 . 5 pro duce identic al fleets. This is exp ected: when the planner’s optimal γ ∗ matc hes the retrofit’s γ , the tw o approac hes arriv e at the same ( n s , n l ) b y construction. Theorem 2 is satisfied (co-design ≤ retrofit), but the inequalit y is tigh t. The co-design adv antage is largest when γ ∗ exceeds the retrofit’s γ , which o ccurs when most ab o ve-threshold traffic is b orderline (Archet yp e I/I I, Azure and LMSYS). F or Agent-hea vy (Archet yp e I I with disp ersed ab o ve-threshold traffic), γ ∗ = 1 . 5 is already the practical limit: compressing further lea ves a harder long-p ool distribution and yields no net sa ving. 9 7.3 Compression Pip eline Latency The compressor is applied only to b orderline requests, so the mean o verhead p er request is β × t compress . T able 4 shows the latency profile. T able 4: End-to-end compressor latency (ms), measured on Intel Xeon Platinum 8568Y+ single core. “Ov erhead/req” is the mean added latency across al l requests, weigh ted b y β . W orkload B short β p50 p95 p99 Overhead/req Azure 4,096 0.078 1.8 ms 4.2 ms 6.5 ms < 0.2 ms LMSYS 1,536 0.046 1.2 ms 3.1 ms 5.2 ms < 0.1 ms Agen t-heavy 8,192 0.112 3.4 ms 6.2 ms 7.8 ms 0.39 ms Ev en in the worst case (agen t-heavy), the 0.39 ms av erage ov erhead is invisible against a 500 ms TTFT budget. 7.4 Analytical Mo del V alidation W e v alidate the FleetOpt analytical mo del against inference-fleet-sim [ Chen et al. , 2026c ], an op en-source discrete-even t simulator for heterogeneous LLM GPU fleets. The simulator driv es P oisson arriv als from the empirical CDF with λ = 1 , 000 req/s and records the fraction of slot-time that KV-cache slots are busy (GPU utilization ˆ ρ ). T able 5 compares the analytical utilization ρ ana = λ p / ( n · µ gpu ) against the DES-measured ˆ ρ for the p ool-routing ( γ = 1 ) fleet from T able 3 . T able 5: Analytical vs. DES GPU utilization ρ at λ = 1 , 000 req/s, po ol-routing fleet ( γ = 1 ). “Error” = ( ρ ana − ˆ ρ ) / ˆ ρ . DES uses 30,000 requests p er p ool. W orkload Pool n GPUs ρ ana ˆ ρ (DES) Error Azure Short 43 0.848 0.865 − 2 . 1% Long 131 0.845 0.847 − 0 . 1% LMSYS Short 7 0.771 0.792 − 2 . 7% Long 74 0.845 0.853 − 1 . 0% Agen t-heavy Short 229 0.848 0.868 − 2 . 2% Long 2,037 0.850 0.850 − 0 . 1% The analytical mo del predicts GPU utilization within 3% of the DES across all p o ols and w orkloads. Both analytical and DES v alues sit near ρ max = 0 . 85 , confirming that the ρ max - constrained sizing hits its target. The analytical mo del is slightly optimistic: actual utilization runs 0.1–2.7% ab o ve prediction in all cases. In practice this means the pro visioned fleet is just barely busier than exp ected; adding 1–3 GPUs p er p o ol eliminates even this small gap. These fleets op erate in the many-serv er regime: total KV slots c = n · n max ranges from 112 to 32,592 across our configurations. A t that scale the Erlang-C probability C ( c, ϱ ) ≪ 1 , slot-wait times are negligible, and fleet sizing is dominated en tirely by the utilization cap ρ max . In this regime the full M/G/ c + Kimura apparatus reduces to the simpler b ound n ∗ ≈ ⌈ λ p / ( ρ max µ ) ⌉ ; the queueing machinery b ecomes load-b earing only for smaller, ligh tly-provisioned fl eets (few GPUs, high ϱ ). P99 TTFT. Because queue waits are negligible, TTFT is dominated by prefill time. An- alytical P 99 TTFT for the PR fleet: Azure short 20 ms / long 80 ms; LMSYS short 11 ms / long 48 ms; Agen t-heavy short 47 ms / long 220 ms — all comfortably within the 500 ms SLO. The FleetOpt planner enforces the SLO constraint (Eq. ( 8 )) directly; in the man y-server 10 regime this constrain t is non-binding and the 500 ms SLO is met with large margin across all configurations. 7.5 Arriv al-Rate Sensitivity T able 6 shows how fleet sizes scale with arriv al rate for the Agent-hea vy workload. The savings from po ol routing (5.4–5.5%) and FleetOpt (6.2–6.8%) are stable across a 20 × range of ar- riv al rates, confirming that th e prop ortional GPU savings from the t wo-po ol arc hitecture scale linearly with load. T able 6: Fleet size and savings vs. arriv al rate λ (Agen t-heavy , B short = 8 , 192 , γ ∗ from Algo- rithm 1 ). λ (req/s) Homo PR FleetOpt ( γ ∗ ) PR saving FleetOpt saving 100 240 227 225 ( γ ∗ = 1 . 2 ) 5.4% 6.2% 200 480 454 448 ( γ ∗ = 1 . 5 ) 5.4% 6.7% 500 1,199 1,134 1,119 ( γ ∗ = 1 . 5 ) 5.4% 6.7% 1,000 2,397 2,266 2,236 ( γ ∗ = 1 . 5 ) 5.5% 6.7% 2,000 4,794 4,531 4,470 ( γ ∗ = 1 . 5 ) 5.5% 6.8% 8 Related W ork P o ol routing and length-sp ecialized serving. Chen et al. [ 2026b ] establishes p o ol routing and the GPU savings formula α (1 − 1 /ρ ) . Y uan et al. [ 2025 ] indep enden tly v alidates length- sp ecialized partitioning. She et al. [ 2026 ] and Agraw al et al. [ 2024a ] address length heterogeneity at the engine lay er; this pap er addresses the upstream provisioning decision. Prompt compression as routing lev er. Chen et al. [ 2026a ] first uses prompt compression as an op erational fleet routing lever, studying C&R as a retrofit on a fixed fleet. Here w e treat C&R as the implementation me chanism for the analytically-deriv ed optimal fleet b ound- ary . The t wo are complemen tary: the analytical mo del prescrib es B ∗ short , and C&R makes it ac hiev able without hardware changes. LLM capacit y planning. Romero et al. [ 2021 ] and Gujarati et al. [ 2020 ] study model- v ariant selection and predictiv e scaling, resp ectiv ely . P atel et al. [ 2024 ] analyzes prefill/deco de disaggregation but do es not optimize fleet size jointly with routing. FleetOpt is the first system to jointly derive the optimal p o ol b oundary and GPU counts from a workload CDF. Queueing mo dels for serving. The M/G/ c queue and Erlang-C formula are classical [ Harc hol- Balter , 2013 ]. Shahrad et al. [ 2020 ] applies queueing to serverless cold-start; Li et al. [ 2023b ] to mo del placement. FleetOpt applies M/G/ c to the p ool routing + compression joint pro vi- sioning problem, with the Kimura [ Kimura , 1994 ] correction for service-time v ariance. Fleet-lev el LLM simulation. Vidur [ Agra wal et al. , 2024b ] and similar to ols simulate single inference engine instances. inference-fleet-sim [ Chen et al. , 2026c ] provides fleet-level DES for heterogeneous m ulti-p o ol deploymen ts and is used here to v alidate the FleetOpt analytical mo del. 11 9 Conclusion Pro visioning LLM GPU fleets for worst-case context lengths is exp ensiv e and largely a void- able. FleetOpt tackles this b y starting from first principles: deriv e the minimum-cost fleet analytically , then build it. The analytical core (M/G/ c queuing mo del with c = n · n max total KV slots, Erlang-C in version) shows that the optimal fleet is a tw o-p o ol architecture whose b oundary B ∗ short satisfies an equal marginal GPU cost condition. The fundamental barrier to ac hieving B ∗ short in practice is the cost cliff: borderline requests at the optimal b oundary pa y ρ = 8 – 42 × the throughput cost of requests just b elo w it. Compress-and-Route ( C&R ) resolves this barrier b y compressing b orderline prompts at the gatewa y , conv erting a hardw are constraint in to a soft ware parameter. The combined framew ork sav es 6–82% in GPU cost versus homogeneous deploymen t, de- p ending on the w orkload archet yp e and cliff ratio. P o ol routing alone sa ves 5.5–41.7%; C&R adds 1.2–43.7 p ercen tage p oin ts b ey ond p o ol routing. The largest gains arise when β is signifi- can t and the cliff ratio ρ is large: Azure at B short = 4 , 096 achiev es 82.4% sa vings with ρ = 16 × and γ ∗ = 2 . 0 . F or Agent-hea vy at B short = 8 , 192 with ρ = 8 × , gains are more mo dest (6.7%) b ecause 26% of traffic remains ab o ve γ B short . Ho w m uch C&R co-design adds beyond p ool routing depends on three factors: the cliff ratio ρ , the b orderline fraction β , and compressibilit y p c . None of these can be read off a GPU sp ec sheet; they require a calibrated workload CDF and the recalibration of µ l for the p ost-compression distribution. inference-fleet-sim [ Chen et al. , 2026c ] v alidates this within 3% utilization error. Skipping the recalibration leads to o ver-optimistic savings estimates and under-pro visioned fleets. The FleetOpt planner outputs the optimal ( n ∗ s , n ∗ l , B ∗ short , γ ∗ ) in under 1 ms, making it practical to re-run whenever the workload CDF shifts. References Amey Agraw al, Nikhil Kedia, Ashish P anw ar, Jay ashree Mohan, Nipun Kw atra, Bharga v S. Gula v ani, Alexey T umanov, and Ramachandran Ramjee. T aming throughput-latency tradeoff in LLM inference with Sarathi-Serve. In Pr o c. OSDI , 2024a. Amey Agra wal, Nitin Kedia, Ashish P anw ar, Jay ashree Mohan, Nipun K watra, Bhargav S Gula v ani, Ramachandran Ramjee, and Alexey T umano v. Vidur: A large-scale simulation framew ork for LLM inference, 2024b. Amey Agra w al, Haoran Qiu, Junda Chen, Íñigo Goiri, Chao jie Zhang, Ra yyan Shah id, Ra- mac handran Ramjee, Alexey T umanov, and Esha Choukse. No request left b ehind: T ac kling heterogeneit y in long-context LLM inference with Medha. , 2024c. Huamin Chen, Xunzh uo Liu, Junchen Jiang, Bo wei He, and Xue Liu. Compress-and-route: Routing-la yer prompt compression against the long-context cost cliff in LLM inference fleets. arXiv pr eprint , 2026a. Manuscript under review. Huamin Chen, Xunzhuo Liu, Junc hen Jiang, Bo wei He, and Xue Liu. T ok en-budget-aw are p o ol routing for cost-efficient LLM inference. arXiv pr eprint , 2026b. Man uscript under review. Huamin Chen, Xunzhuo Liu, Y uhan Liu, Junchen Jiang, Bo wei He, and Xue Liu. inference-fleet- sim: A queueing-theory-grounded fleet capacit y planner for LLM inference, 2026c. Replace 2603.XXXXX with the assigned arXiv ID. Arpan Gujarati, Reza Karimi, Safya Alza yat, W ei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. Serving DNNs like clo c kwork: P erformance predictabilit y from the b ottom up. In Pr o c. OSDI , 2020. 12 Mor Harchol-Balter. Performanc e Mo deling and Design of Computer Systems: Queueing The ory in A ction . Cambridge Universit y Press, 2013. Carlos E. Jimenez, John Y ang, Alexander W ettig, Shun yu Y ao, Kexin P ei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language mo dels resolve real-w orld GitHub issues? In Pr o c. ICLR , 2024. T oshikazu Kim ura. T w o-moment approximations for the mean w aiting time in the M/G/c queue. J. Op er. R es. So c. Jap an , 37(3):238–256, 1994. P atrick Lewis, Ethan P erez, Aleksandra Piktus, F abio P etroni, Vladimir Karpukhin, Naman Go yal, Heinric h Küttler, Mik e Lewis, W en tau Yih, Tim Ro c ktäsc hel, Sebastian Riedel, and Dou we Kiela. Retriev al-augmented generation for knowledge-in tensive NLP tasks. In Pr o c. NeurIPS , 2020. Y ucheng Li, Bo Dong, Chengh ua Lin, and F rank Guerin. Compressing context to enhance inference efficiency of large language mo dels. In Pr o c. EMNLP , 2023a. Zh uohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Y anping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. AlpaServe: Statistical mul- tiplexing with mo del parallelism for deep learning serving. In Pr o c. OSDI , 2023b. Rada Mihalcea and Paul T arau. T extRank: Bringing order into texts. In Pr o c. EMNLP , 2004. Prat yush Patel, Esha Choukse, Chao jie Zhang, Íñigo Goiri, Brijesh W arrier, Nithish Ma- halingam, and Ricardo Bianc hini. Split wise: Efficien t generativ e LLM inference using phase splitting. In Pr o c. ISCA , 2024. F rancisco Romero, Qian Li, Neera ja J. Y adwadkar, and Christos Kozyrakis. INF aaS: Automated mo del-less inference serving. In Pr o c. USENIX A TC , 2021. Mohammad Shahrad, Ro drigo F onseca, Ínigo Goiri, Gohar Chaudhry , Paul Batum, Jason Co ok e, Eduardo Laureano, Colb y T resness, Mark R ussinovic h, and Ricardo Bianchini. Server- less in the wild: Characterizing and optimizing the serv erless w orkload at a large cloud pro vider. In Pr o c. USENIX A TC , 2020. Jiansh u She, Zonghang Li, Hongc hao Du, Shangyu W u, W enhao Zheng, Eric Xing, Zhengzhong Liu, Huaxiu Y ao, Jason Xue, and Qirong Ho. LAPS: A length-aw are-prefill LLM serving system. , 2026. F anjia Y an, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. P atil, Ion Stoica, and Joseph E. Gonzalez. BF CL: Berkeley function-calling leaderb oard. https://gorilla. cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html , 2024. Yitao Y uan, Chenqi Zhao, Bohan Zhao, Zane Cao, Y ongchao He, and W enfei W u. CascadeInfer: Lo w-latency and load-balanced LLM serving via length-aw are scheduling. , 2025. Zeyu Zhang and Haiying Shen. PecSc hed: Preemptiv e and efficien t cluster scheduling for LLM inference. , 2024. Lianmin Zheng, W ei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao W u, Y ong- hao Zh uang, Zhuohan Li, Zi Lin, Eric P . Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS-Chat-1M: A large-scale real-world LLM conv ersation dataset. In Pr o c. ICLR , 2024. 13 A Erlang-C In v ersion: Algorithm and Numerical Stabilit y The minimum feasible GPU count n ∗ is found by binary search ov er c using the numerically stable recursiv e Erlang-C form: C ( c, ϱ ) = 1 1 + (1 − ϱ ) · c ! · P c − 1 k =0 ( cϱ ) k − c k ! , (16) computed in log-space to a void ov erflow at large c . The utilization cap ρ max = 0 . 85 ensures w e alw ays search ab o ve the p oin t where the Kimura approximation loses accuracy and the Erlang-C form ula diverges near ϱ → 1 . B Pro of of Prop osition 1 T otal cost as a function of B short : C ( B short ) = c s n ∗ s ( F ( B short ) λ ) + c l n ∗ l ((1 − F ( B short )) λ ) . Dif- feren tiating with resp ect to B short and applying the chain rule: d C dB short = λf ( B short )  c s ∂ n ∗ s ∂ λ s − c l ∂ n ∗ l ∂ λ l  . (17) Since λf ( B short ) > 0 for any in terior b oundary , the F OC d C /dB short = 0 requires the brack eted term to b e zero: c s ∂ n ∗ s ∂ λ s = c l ∂ n ∗ l ∂ λ l . (18) This is an equal marginal GPU cost condition: the marginal GPU cost of routing one additional request-p er-second to the short p ool equals the marginal GPU saving from removing one request- p er-second from the long p o ol. F or a homogeneous fleet ( c s = c l , same GPU type), the condition simplifies to ∂ n ∗ s /∂ λ s = ∂ n ∗ l /∂ λ l . Under the ρ max -constrained sizing regime (which dominates in practice as sho wn in Section 7.4 ): n ∗ ≈ ⌈ λ p / ( ρ max µ ) ⌉ , so ∂ n ∗ /∂ λ p ≈ 1 / ( ρ max µ ) . The F OC then requires µ s = µ l , i.e., b oth p o ols hav e the same service rate. This is generally not achiev able for a giv en B short b ecause short and long p o ols serve differen t request length distributions; in practice B ∗ short is found n umerically by the sw eep in Algorithm 1 . □ C Compression Fidelit y T able 7 summarizes fidelity measuremen ts on 300 b orderline prompts drawn from LMSYS-Chat- 1M at the Agent-hea vy configuration ( B short = 8 , 192 , γ = 1 . 5 , borderline band 8 , 192 – 12 , 288 tok ens). F or the Azure and LMSYS ev aluation configurations, whic h use smaller B short v alues (4,096 and 1,536 resp ectiv ely), the b orderline band is narrow er and prompts are shorter, so fidelit y is at least as go od as rep orted here. All measurements used the compressor describ ed in Section 5.2 . 14 T able 7: Compression fidelity on 300 LMSYS-Chat-1M b orderline prompts. p c : fraction suc- cessfully compressed within budget; BER TScore F1: semantic similarit y (RoBER T a-large); R OUGE-L R: longest-common-subsequence recall; TF-IDF cos: tok en-ov erlap cosine similarity . Metric Mean p10 p50 p90 p c (compressibilit y) 1.00 — — — BER TScore F1 0.884 0.831 0.891 0.934 R OUGE-L recall 0.856 0.783 0.861 0.921 TF-IDF cosine 0.981 0.963 0.984 0.996 Mean tok en reduction 15.4% 6.1% 14.2% 26.3% 15

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment