Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient “L-aggregated” stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.


💡 Research Summary

The paper tackles the problem of improving large language model (LLM) answer reliability by leveraging answer consistency, a technique known as Self‑Consistency (SC). While SC simply samples a fixed number of reasoning paths and selects the most frequent final answer, this approach can be wasteful because easy queries often need far fewer samples than hard ones. The authors propose an optimal adaptive sampling framework called Adaptive Self‑Consistency (ASC) that incorporates an informative Bayesian prior over the unknown answer distribution.

In the Bayesian formulation, each LLM query induces a categorical distribution π = (p₁,…,p_K) over K possible answers, ordered so that p₁ > p₂ ≥ … ≥ p_K. Sampling the model repeatedly yields a sequence of answers drawn i.i.d. from π. The goal is to stop as soon as the posterior probability that the most frequent observed answer is indeed the true mode exceeds a confidence threshold 1 − δ. The exact posterior P(H₁ | Cₙ) can be written as a ratio of sums over all injective mappings from observed answer labels to the latent answer space. Unfortunately, enumerating these mappings incurs O(K!) complexity, which is infeasible for realistic K (tens or hundreds of distinct answers).

To overcome this bottleneck, the authors introduce an “L‑aggregated” posterior approximation. Instead of tracking the full count‑of‑counts vector Cₙ (which records how many answers appear v₁ times, v₂ times, etc.), they retain only the top L − 1 most frequent answers explicitly and collapse the rest into a single “other” bucket. The resulting state Cₙᴸ contains at most L entries, and the posterior can be computed in O(K·L) time. Crucially, they prove that for any L ≥ 3 the approximation is asymptotically optimal: as the confidence level approaches 1 (δ → 0), the expected stopping time under the L‑aggregated posterior matches that of the exact posterior. Moreover, even with a coarse L = 2 (only the most frequent answer tracked) the method strictly dominates prior‑free ASC when the prior is known exactly. When the prior is uncertain or drawn from a finite set of candidate priors, L = 2 may lose efficiency, but L ≥ 3 remains robust.

The theoretical analysis proceeds by formulating the stopping problem as a sequential hypothesis test. The authors derive closed‑form stopping thresholds for the binary case (K = 2) and show how the L‑aggregated posterior yields tractable update rules for general K. They also discuss how to learn the prior distribution from a small historical dataset of LLM queries, enabling the method to adapt to different domains (math, logic, code generation).

Empirically, the paper evaluates the approach on several real‑world benchmarks, including the MATH dataset, logical reasoning tasks, and code synthesis problems. Across all settings, the L‑aggregated ASC with L = 3 reduces the number of LLM calls by 30‑50 % compared to the baseline ASC that uses a fixed sampling budget, while maintaining or slightly improving final answer accuracy. The savings are most pronounced on “easy” queries where the prior is sharply peaked, allowing early termination after only a few consistent samples. On harder, “flat‑prior” queries, the method still outperforms prior‑free baselines because the three‑bucket summary (most frequent, second most frequent, and all others) captures enough information to distinguish the true mode with fewer samples. Table 1 in the paper shows that for confidence levels 1 − δ = 0.8, 0.9, 0.95, the L = 3 policy achieves stopping times virtually identical to the full‑posterior (L = K) policy, yet its computation time is 5‑10× lower.

In summary, the contribution of the paper is threefold: (1) a principled Bayesian formulation of adaptive self‑consistency that explicitly incorporates prior knowledge about answer distributions; (2) an efficient L‑aggregated posterior approximation that reduces computational complexity from factorial to linear in K while preserving asymptotic optimality for L ≥ 3; and (3) extensive theoretical and empirical validation demonstrating substantial inference‑cost reductions without sacrificing accuracy. The result is a practical, theoretically grounded method for “when to stop” in LLM inference, opening avenues for cost‑effective deployment of large models in latency‑sensitive applications. Future work may explore richer priors, multi‑turn dialogue settings, and integration with other inference‑budgeting techniques.


Comments & Academic Discussion

Loading comments...

Leave a Comment