BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (i) An algorithmic perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (ii) a systems perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (i) Determining optimal query routing strategies under latency and quality requirements, (ii) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (iii) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a quality-aware scheduling system that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a multi-objective Bayesian optimization (MOBO) framework to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that BOute outperforms state-of-the-art LLM serving systems by up to 157% and 59% on average under identical cost budgets and quality requirements, or reducing serving costs by 15%-61% (38% on average) while maintaining the same performance targets, validating its effectiveness in achieving cost-efficient LLM serving.
💡 Research Summary
BOute addresses the pressing need for cost‑efficient large language model (LLM) serving by jointly exploiting heterogeneity in both models and GPU hardware. The paper first observes that modern LLM deployments face two complementary sources of heterogeneity: (i) model heterogeneity, where smaller, cheaper models can handle easy queries while larger, more capable models are required for difficult queries; and (ii) GPU heterogeneity, where a mix of high‑end accelerators (e.g., NVIDIA H100) and more economical devices (e.g., RTX 5090) offers a broader performance‑cost spectrum. Existing works treat these two dimensions in isolation, but the authors argue that routing decisions affect the load distribution across models, which in turn influences the optimal GPU allocation, and vice‑versa. This bidirectional dependency creates a circular optimization problem that cannot be solved by optimizing routing or deployment alone.
To tackle this, BOute formulates a constrained multi‑objective optimization problem. Decision variables include routing thresholds τ that determine the proportion of queries sent to each model, and a set of deployment parameters: which GPU type(s) each model runs on, how many GPUs of each type are allocated, and the degree of data, tensor, and pipeline parallelism applied. The three objectives are (1) minimize the 95th‑percentile latency (P95), (2) satisfy a quality constraint (e.g., accuracy on GSM8K), and (3) stay within a given monetary budget. The authors adopt a Multi‑Objective Bayesian Optimization (MOBO) framework. A Gaussian Process surrogate models the black‑box relationship between the decision variables and the three performance metrics; an acquisition function based on Expected Improvement guides the selection of new candidate configurations. By iteratively updating the surrogate, BOute efficiently explores the configuration space and converges to a Pareto front of latency‑quality‑cost trade‑offs.
The experimental evaluation uses two Llama 3.1 variants (70B and 8B) and compares three hardware setups under comparable cost: (i) a homogeneous cluster of 12 H100 GPUs, (ii) a heterogeneous cluster of 6 RTX 5090 + 10 H100 GPUs, and (iii) a baseline homogeneous deployment of a single model. The authors first demonstrate that naive routing with uniform GPU allocation leads to a bottleneck on the large model, increasing latency. Adjusting GPU allocation proportionally to the load (4 GPUs for the 8B model, 8 GPUs for the 70B model) reduces P95 latency by 20 %. Introducing heterogeneous GPUs further improves performance: routing 30 % of queries to the 8B model on RTX 5090 and 70 % to the 70B model on H100 yields a 33 % latency reduction (P95 = 17.1 s) and a quality score of 91.2, surpassing the 90‑point requirement.
BOute’s MOBO automatically discovers such configurations. Across a range of latency and quality constraints, the system achieves up to 2.6× lower latency (1.6× average improvement) or up to 1.9× higher throughput while staying within the same budget. In cost‑reduction mode, BOute cuts serving expenses by 15 %–61 % (38 % on average) without sacrificing latency or quality targets. Additional benchmarks reveal that the small model runs ~1.5× faster on RTX 5090 than on H100, whereas the large model benefits ~2× speedup on H100 compared to RTX 5090, confirming that model‑GPU matching is essential for cost‑efficiency.
The paper contributes (1) a systematic workload characterization that quantifies the synergy between model routing and heterogeneous deployment, (2) a formal problem formulation and a scalable MOBO solution that yields Pareto‑optimal configurations, and (3) a prototype implementation with extensive experiments validating significant gains over state‑of‑the‑art serving systems. The authors suggest future extensions such as handling more diverse model families, dynamic workload fluctuations, and online adaptation to real‑time pricing changes. BOute thus demonstrates that co‑optimizing routing and deployment via multi‑objective Bayesian methods is a practical and powerful approach to achieving cost‑efficient, high‑quality LLM serving in heterogeneous cloud environments.
Comments & Academic Discussion
Loading comments...
Leave a Comment