Designing Service Systems from Textual Evidence
Designing service systems requires selecting among alternative configurations – choosing the best chatbot variant, the optimal routing policy, or the most effective quality control procedure. In many service systems, the primary evidence of performance quality is textual – customer support transcripts, complaint narratives, compliance review reports – rather than the scalar measurements assumed by classical optimization methods. Large language models (LLMs) can read such textual evidence and produce standardized quality scores, but these automated judges exhibit systematic biases that vary across alternatives and evaluation instances. Human expert review remains accurate but costly. We study how to identify the best service configuration with high confidence while minimizing expensive human audits, given that automated evaluation is cheap but biased. We formalize this as a sequential decision problem where a biased proxy score is observed for every evaluation, and a verified outcome can be acquired selectively at additional cost. We prove that LLM-only selection fails under arm-dependent bias, and that naive selective-audit estimators can be asymptotically biased. We develop an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences. Our algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable. We prove correctness and establish instance-dependent cost bounds showing near-optimal efficiency. On a customer support ticket classification task, our algorithm correctly identifies the best model in 40/40 trials while achieving 90% audit cost reduction.
💡 Research Summary
The paper tackles a practical problem that arises when the performance evidence of a service system is primarily textual—such as call‑center transcripts, customer‑support tickets, or compliance review reports. Large language models (LLMs) can read these texts and output a cheap proxy score, but the scores are systematically biased in ways that depend on both the configuration (the “arm”) and the specific instance. Because the bias can differ across alternatives, simply collecting more proxy scores cannot guarantee identification of the truly best configuration; the authors prove that an LLM‑only strategy fails under arm‑dependent bias (Theorem 3.5, Part 1).
To overcome this limitation, the authors introduce selective human audits. A human expert can provide an unbiased, high‑cost outcome for any evaluated instance, but audits are expensive and therefore must be used sparingly. The key statistical challenge is that audits are chosen adaptively after observing the proxy score, which creates a selection bias: naïve estimators that ignore the audit decision rule remain asymptotically biased even with infinitely many audits (Theorem 3.5, Part 2).
The solution consists of two parts. First, the authors develop an unbiased estimator that combines the cheap proxy mean with an inverse‑propensity‑weighted (IPW) correction for the residuals between the proxy and the audited outcome. The IPW term compensates for the fact that audits are more likely on uncertain or extreme proxy scores, restoring unbiasedness under any adaptive audit policy. Second, they construct time‑uniform confidence sequences for the corrected means, which remain valid under adaptive sampling, adaptive auditing, and optional stopping.
Building on these estimators, the authors propose PP‑LUCB (Prediction‑Powered Lower and Upper Confidence Bound), an algorithm that extends the classic LUCB best‑arm identification method. In each round PP‑LUCB (i) selects the two arms with the highest upper confidence bounds, (ii) draws additional cheap proxy evaluations for them, and (iii) decides whether to request a human audit for the current instance. The audit decision follows a Neyman‑allocation style rule: audits are concentrated on arms and instances where the LLM’s bias (as measured by the residual variance) is largest, thereby minimizing total audit cost while controlling bias.
Theoretical contributions include: (a) a proof of δ‑correctness, guaranteeing that the algorithm stops with the true best arm with probability at least 1 − δ; (b) instance‑independent upper bounds on total cost that show near‑optimality compared to an oracle that knows the bias structure; (c) an asymptotically optimal tracking variant (PP‑Track‑and‑Audit) that matches information‑theoretic lower bounds; and (d) extensions to handle delayed audit feedback while preserving time‑uniform inference.
Empirical evaluation covers both synthetic experiments—where bias patterns are controlled to verify coverage and cost‑reduction claims—and real‑world case studies using live LLM APIs. In a customer‑support ticket classification task, PP‑LUCB identified the best model in all 40 trials while cutting audit expenditure by roughly 90 %. In a more complex queue‑design scenario involving routing policies, prompting strategies, and model selection, the algorithm achieved high design accuracy with similar cost savings. Additional experiments confirm that delayed audits increase decision latency only by the maximum delay, without affecting monetary cost or correctness.
Overall, the paper presents a rigorous statistical framework for integrating cheap, biased LLM evaluations with expensive, accurate human audits. By correcting for selection bias through IPW and guiding audits toward the most informative regions, PP‑LUCB enables cost‑effective, high‑confidence selection of the optimal service configuration in settings where textual evidence dominates. This work opens a new avenue for AI‑human collaborative decision making in operations, healthcare, compliance, and any domain where performance must be judged from unstructured text.
Comments & Academic Discussion
Loading comments...
Leave a Comment