Conformal Thinking: Risk Control for Reasoning on a Compute Budget

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reasoning Large Language Models (LLMs) enable test-time scaling, with dataset-level accuracy improving as the token budget increases, motivating adaptive reasoning – spending tokens when they improve reliability and stopping early when additional computation is unlikely to help. However, setting the token budget, as well as the threshold for adaptive reasoning, is a practical challenge that entails a fundamental risk-accuracy trade-off. We re-frame the budget setting problem as risk control, limiting the error rate while minimizing compute. Our framework introduces an upper threshold that stops reasoning when the model is confident (risking incorrect output) and a novel parametric lower threshold that preemptively stops unsolvable instances (risking premature stoppage). Given a target risk and a validation set, we use distribution-free risk control to optimally specify these stopping mechanisms. For scenarios with multiple budget controlling criteria, we incorporate an efficiency loss to select the most computationally efficient exiting mechanism. Empirical results across diverse reasoning tasks and models demonstrate the effectiveness of our risk control approach, demonstrating computational efficiency gains from the lower threshold and ensemble stopping mechanisms while adhering to the user-specified risk target.

💡 Research Summary

The paper tackles the practical problem of allocating compute resources for large language model (LLM) reasoning at inference time. While it is well‑known that giving a model more reasoning tokens generally improves dataset‑level accuracy—a phenomenon called test‑time scaling—real‑world deployments need a way to decide how many tokens to spend on each instance and when to stop. Existing adaptive‑thinking methods address only the former by setting a confidence threshold: once the model’s confidence exceeds a fixed value, reasoning halts. However, choosing that threshold is non‑intuitive, dataset‑specific, and still leaves many hard examples consuming the full token budget without ever reaching the threshold.

The authors reframe the budget‑setting problem as risk control. “Risk” is defined as the expected loss (error rate) of the final answer, and users specify a tolerable risk level ε (e.g., 5 %). The goal is to automatically select stopping criteria that guarantee the empirical risk on a validation set does not exceed ε while minimizing compute. To achieve this, they introduce two complementary stopping mechanisms:

Upper‑threshold (λ⁺) – the classic “stop when confident” rule. The model emits a scalar signal sₜ (e.g., entropy‑based confidence) at each reasoning step t. When a transformed version ŝₜ ≥ λ⁺, reasoning stops and the current answer is returned. This controls false‑positive risk (stopping on an incorrect answer).
Lower‑threshold (λ⁻(t; c)) – a novel “stop when not making progress” rule. It is a parametric, time‑dependent function defined as a sigmoid of the remaining token budget:
λ⁻(t; c) = σ( c·(ωₜ − B)/2 ), where ωₜ is the number of tokens used so far, B is the total allowed budget, and c shapes the curve. When ŝₜ < λ⁻, the model is deemed to be stagnating, and reasoning halts early, delegating the instance to an external expert or simply returning a “cannot solve” signal. This controls false‑negative risk (prematurely giving up on a solvable problem).

Four loss functions are defined to quantify the trade‑off between correctness and efficiency:

False‑positive loss (ℓ_upper^FP) – 1 if the upper threshold fires but the answer is wrong, 0 otherwise.
False‑negative loss (ℓ_lower^FN) – proportional to the remaining steps after the lower threshold fires, weighted by whether a correct answer would appear later.
Upper‑threshold efficiency loss (ℓ_upper^eff) – the fraction of tokens spent after the first correct answer appears (regret).
Lower‑threshold efficiency loss (ℓ_lower^eff) – the proportion of the budget spent on an instance that has never produced a correct answer up to step t.

These losses allow the authors to formulate a distribution‑free risk control problem: given a validation set V, a candidate set of uncertainty signals S, and discrete grids for λ⁺ and c, find the pair (λ⁺, c) that satisfies the risk constraint E

Conformal Thinking: Risk Control for Reasoning on a Compute Budget

💡 Research Summary

Comments & Academic Discussion

Leave a Comment