MarkovScale: Towards Optimal Sequential Scaling at Inference Time
Sequential scaling is a prominent inference-time scaling paradigm, yet its performance improvements are typically modest and not well understood, largely due to the prevalence of heuristic, non-principled approaches that obscure clear optimality bounds. To address this, we propose a principled framework that models sequential scaling as a two-state Markov process. This approach reveals the underlying properties of sequential scaling and yields closed-form solutions for essential aspects, such as the specific conditions under which accuracy is improved and the theoretical upper, neutral, and lower performance bounds. Leveraging this formulation, we develop MarkovScale, a practical system that applies these optimality criteria to achieve a theoretically grounded balance between accuracy and efficiency. Comprehensive experiments across 3 backbone LLMs, 5 benchmarks, and over 20 configurations show that MarkovScale consistently outperforms state-of-the-art parallel and sequential scaling methods, representing a significant step toward optimal and resource-efficient inference in LLMs. The source code will be open upon acceptance at https://open-upon-acceptance.
💡 Research Summary
The paper tackles the problem of inference‑time scaling for large language models (LLMs), focusing on the sequential scaling paradigm where a model iteratively refines its answer. While parallel scaling methods (e.g., Best‑of‑N, Self‑Consistency) have shown strong gains, sequential approaches have been hampered by modest improvements and occasional performance degradation, largely because they rely on heuristic stopping criteria. To bring a principled foundation, the authors model the sequential process as a discrete‑time two‑state Markov chain. Each iteration’s output is classified as either correct (C) or wrong (W). Transition probabilities a = P(W|C) and b = P(C|W) capture the likelihood of a correct answer becoming wrong or a wrong answer being corrected in the next step. The initial zero‑shot correctness probability p₀ = P(X₀ = C) is also incorporated.
By diagonalizing the transition matrix, they derive a closed‑form expression for the probability of correctness after i iterations: pᵢ = L + λⁱ(p₀ − L), where L = b/(a + b) is the asymptotic accuracy (theoretical upper bound) and λ = 1 − a − b (|λ| < 1 for convergence). This simple formula enables analytic prediction of how a given model will behave under unlimited sequential refinement.
To decide whether sequential scaling is beneficial, the authors introduce an “benefit function” gᵢ = pᵢ − p₀ + σ, where σ is a robustness margin that absorbs verifier noise, non‑Markovian effects, decoding stochasticity, and estimation errors. The sign of gᵢ partitions the space into three regimes: (1) Beneficial scaling when p₀ < L + σ, (2) Detrimental scaling when p₀ > L + σ, and (3) Neutral scaling when p₀ ≈ L + σ. This yields Theorem III.1, a closed‑form criterion that tells a practitioner, before any runtime cost, whether a particular question‑model pair is likely to improve with more iterations.
Beyond the binary decision, the paper formulates an optimal stopping problem: given a target confidence τ (e.g., 0.9), find the smallest iteration i* such that pᵢ ≥ τ while minimizing token consumption. Substituting the closed‑form pᵢ into the inequality yields λⁱ ≥
Comments & Academic Discussion
Loading comments...
Leave a Comment