Federate the Router: Learning Language Model Routers with Sparse and Decentralized Evaluations

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large language models (LLMs) are increasingly accessed as remotely hosted services by edge and enterprise clients that cannot run frontier models locally. Since models vary widely in capability and price, routing queries to models that balance quality and inference cost is essential. Existing router approaches assume access to centralized query-model evaluation data. However, these data are often fragmented across clients, such as end users and organizations, and are privacy-sensitive, which makes centralizing data infeasible. Additionally, per-client router training is ineffective since local evaluation data is limited and covers only a restricted query distribution and a biased subset of model evaluations. We introduce the first federated framework for LLM routing, enabling clients to learn a shared routing policy from local offline query-model evaluation data. Our framework supports both parametric multilayer perceptron router and nonparametric K-means router under heterogeneous client query distributions and non-uniform model coverage. Across two benchmarks, federated collaboration improves the accuracy-cost frontier over client-local routers, both via increased effective model coverage and better query generalization. Our theoretical results also validate that federated training reduces routing suboptimality.

💡 Research Summary

The paper tackles the practical problem of routing queries to the most appropriate large language model (LLM) when the evaluation data needed to train such routers are fragmented, sparse, and privacy‑sensitive across many clients. Traditional router designs assume a centralized dataset where every query has been evaluated on every model, an assumption that is unrealistic in real‑world deployments where edge devices or enterprise systems can only afford to query a small, client‑specific subset of the available model pool. To address this, the authors propose the first federated learning (FL) framework for LLM routing, enabling multiple clients to collaboratively learn a shared routing policy while keeping raw queries and model outcomes local.

Two families of routers are adapted to the federated setting: a parametric multi‑layer perceptron (MLP) router and a non‑parametric K‑Means router. The MLP router learns a shared embedding trunk and model‑specific linear heads that predict expected accuracy and inference cost for any model given a query embedding. Training follows the standard FedAvg algorithm: each client performs several local SGD steps on its private dataset, then the server aggregates the updates using a weighted average based on client data size. The resulting global parameters define a router that can be queried with any trade‑off parameter λ to select the model maximizing the estimated utility (accuracy minus λ·cost).

The K‑Means router avoids learning altogether. It partitions the query embedding space into K Voronoi cells, computes per‑model average accuracy and cost within each cell locally, and sends these statistics to the server. The server merges statistics across clients to obtain global cell‑wise estimates. At inference time, a new query is assigned to its nearest cluster center and the stored averages are used to compute utility. This approach is especially robust when evaluation data are extremely sparse because it relies on simple averaging rather than gradient‑based learning.

The authors provide theoretical guarantees. For the MLP router they prove convergence of the federated optimization under standard smoothness and bounded‑gradient assumptions, and they show that the global model’s routing sub‑optimality is lower than that of any individual client’s local model, even with heterogeneous data distributions. For the K‑Means router they derive a statistical bound showing that the estimation error decays as O(1/√n) with the number of samples per cell, confirming that aggregating across clients effectively increases sample size and reduces variance.

Empirical evaluation is conducted on two benchmarks: (1) a realistic enterprise log dataset containing millions of queries and ten publicly available LLMs, and (2) a public multi‑task benchmark covering diverse domains and fifteen models. Simulated client populations range from 10 to 50, each client evaluating only 1–3 models per query, creating an extreme sparsity scenario. Results indicate that federated routers consistently outperform routers trained on a single client’s data. The federated MLP router improves the accuracy‑cost frontier by roughly 9 % on average, while the federated K‑Means router yields a 7–8 % gain. Importantly, effective model coverage—i.e., the proportion of models that can be selected for a given client—rises from about 30 % in the local setting to over 55 % after federated training. In high‑heterogeneity regimes the authors introduce an adaptive personalization scheme that interpolates between the global router and a client‑specific fine‑tuned router, mitigating cases where the global policy misaligns with a client’s unique query distribution.

The paper’s contributions are threefold: (1) formulation of LLM routing as a federated learning problem that respects privacy and handles sparse, imbalanced evaluation data; (2) concrete algorithms for both parametric and non‑parametric routers that can be deployed with evolving model pools and client participation; (3) theoretical analysis and extensive experiments demonstrating that federated collaboration reduces routing sub‑optimality, improves generalization to out‑of‑distribution queries, and expands model coverage without exposing private data.

Limitations discussed include the assumption of a static model pool (dynamic addition/removal of models would require re‑clustering or incremental learning), and the need to integrate stronger privacy guarantees such as differential privacy or secure aggregation. Future work may also explore multi‑objective routing that incorporates latency, regulatory constraints, or user‑specific utility functions.

In summary, this work shows that federated learning can be effectively applied to the problem of LLM query routing, providing a practical solution for organizations that must balance performance, cost, and privacy when leveraging multiple remote language models.

Federate the Router: Learning Language Model Routers with Sparse and Decentralized Evaluations

💡 Research Summary

Comments & Academic Discussion

Leave a Comment