On the Power of (Approximate) Reward Models for Inference-Time Scaling
Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.
💡 Research Summary
The paper tackles a fundamental question in modern large‑language‑model (LLM) reasoning: why does inference‑time scaling work when the reward model used to evaluate partial solutions is only an approximation? The authors focus on the Sequential Monte Carlo (SMC) framework, which iteratively generates, evaluates, rejects, and resamples reasoning trajectories. In practice, true reward functions are unavailable; instead, systems rely on learned or heuristic approximations.
To formalize the problem, the authors view inference‑time reasoning as sampling from a reward‑tilted posterior distribution (\tilde\pi) over full reasoning trajectories. The prior is the pretrained model (\pi_{\text{ref}}), and a task‑specific utility (\phi) (e.g., human preference or verifiable reward) defines the tilt. At each intermediate step (t), a value function (\hat V(s_{0:t})) supplied by the approximate reward model evaluates the quality of the partial prefix. The discrepancy between (\hat V) and the optimal value function (V^\ast) is measured by the Bellman error (\epsilon_t = |\mathcal{T}\hat V_t - \hat V_t|_\infty), where (\mathcal{T}) is the Bellman operator.
The central theoretical contribution is a bound that links this Bellman error to the computational complexity of SMC. The authors prove that if the Bellman error is uniformly bounded by (\epsilon = O(1/T)) for a reasoning horizon of length (T), then SMC can achieve a total variation (TV) distance (\delta) from the target distribution using only a polynomial number of particles and a total runtime polynomial in ((T, 1/\delta)). In contrast, without any reward guidance, an information‑theoretic lower bound shows that the required complexity remains exponential in (T). Thus, a modest accuracy requirement on the reward model—specifically, an error that shrinks inversely with the reasoning depth—suffices to convert an exponential‑time inference problem into a polynomial‑time one.
Algorithmically, the paper studies two variants. The first, single‑particle guided SMC (SP‑gSMC), uses the approximate reward to bias the selection of a single “guided” particle while the rest are sampled uniformly. Theorem 4.3 shows that SP‑gSMC alone cannot achieve arbitrary TV accuracy unless the reward model is exact. To overcome this limitation, the authors augment SP‑gSMC with a Metropolis–Hastings (MH) correction step. The corrected algorithm retains the computational simplicity of guided SMC but gains a geometric contraction in TV distance on a high‑probability event; consequently, only (\mathcal{O}(\log(1/\delta))) MH steps are needed to reach the desired accuracy, matching the mixing behavior observed in prior work.
The paper also discusses the optimal “twist” function (V^\ast), which solves a Bellman equation and represents the exact reward‑to‑go. Since computing (V^\ast) is intractable, practical systems learn an approximation (\hat V) via methods such as Contrastive Twist Learning (CTL). The authors argue that if the learning objective explicitly minimizes the Bellman residual, the resulting (\hat V) can satisfy the (O(1/T)) error condition, thereby inheriting the theoretical guarantees.
Although the work is primarily theoretical, Table 1 summarizes the time‑complexity trade‑offs under the key assumptions (Assumptions 3.1 and 3.2). The results provide concrete guidance for designing reward models in LLM inference pipelines: ensure that the Bellman error decays at least as fast as (1/T), allocate a polynomial number of particles (e.g., on the order of (L T^3) for some constant (L)), and optionally apply a lightweight MH correction to guarantee arbitrary TV precision.
In summary, the paper establishes that the quality of an approximate reward model can be succinctly captured by its Bellman error, and that a modest (O(1/T)) bound is sufficient to unlock exponential gains in inference efficiency when using SMC‑based inference‑time scaling. This bridges the gap between empirical observations—where imperfect reward models still yield large performance boosts—and rigorous theory, offering a clear target for future reward‑model learning and for the deployment of scalable, high‑quality LLM reasoning systems.
Comments & Academic Discussion
Loading comments...
Leave a Comment