HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction
Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
💡 Research Summary
The paper introduces HyPER, a training‑free, online control framework for scaling test‑time computation in large language models (LLMs) that use multi‑path chain‑of‑thought (CoT) reasoning. Existing test‑time scaling methods fall into two rigid paradigms. Tree‑based search explicitly expands intermediate reasoning steps according to pre‑defined branching schedules, which interferes with the continuous reasoning flow of post‑trained models and wastes compute on unnecessary branches. Parallel reasoning methods such as Self‑Consistency or Best‑of‑N generate many complete CoT paths and aggregate at the end; they preserve native generation but over‑explore redundant paths and suffer from an “existence‑selection gap” where a correct answer may be present but is outvoted by noisy, high‑frequency incorrect paths.
HyPER reframes test‑time scaling as a dynamic “expand‑reduce” control problem over a pool of hypothesis paths. It observes three empirical facts: (1) the utility of exploration is phase‑dependent—early in decoding, widening the path pool improves coverage, but later the same widening yields diminishing returns; (2) correct and incorrect paths often share long prefixes and diverge only near the tail, with tail‑token confidence distinguishing them; (3) even when a correct path exists, naive majority voting can select an incorrect answer because noisy paths dominate the vote. Based on these, HyPER continuously monitors lightweight, training‑free statistics: mean token confidence ( ¯Cₜ ), mean token entropy ( Hₜ ), top‑1 consensus ratio ( βₜ ), and a diversity score ( Dₜ ) that combines distribution‑level divergence and edit‑distance measures.
Every T decoding steps, an online controller evaluates these signals and selects one of four actions for all surviving paths: NONE (continue standard decoding), SINGLE‑TOKEN (token‑level refinement using MoE routing diversity), MULTI‑TOKEN (aggregate several tokens at once to expand the pool efficiently), or BRANCH (create new sub‑paths when diversity collapses). The controller adapts the per‑path expansion factor rₜ ≈ ⌈W/|Sₜ|⌉ to keep the total number of active paths near a target width W, thereby reallocating compute from exploration to exploitation as the decoding state evolves.
The SINGLE‑TOKEN primitive leverages the inherent expert diversity of Mixture‑of‑Experts (MoE) models: for a given token, multiple expert proposals are generated, aggregated, and fed back into the decoder, allowing local correction of low‑confidence tail regions without resampling entire paths. This addresses the “late‑stage exploitation” need identified in observation 2.
At answer time, HyPER applies a length‑aware, confidence‑weighted voting scheme. Because longer paths tend to accumulate lower average confidence, the voting rule discounts overly long, low‑confidence candidates, mitigating the existence‑selection gap illustrated in Figure 3. Each path’s vote is weighted by its global average token confidence, and a bias toward shorter, higher‑confidence paths improves final answer selection.
Experiments span four MoE LLMs (including Switch‑Transformer, GLaM, and other sparsely‑gated models) and a suite of reasoning benchmarks (AIME, HMMT, MATH, etc.). Under comparable token budgets, HyPER consistently outperforms prior baselines—Self‑Consistency, DeepConf, Thread, and others—by 8–10 percentage points in accuracy while reducing token consumption by 25–40 %. Ablation studies confirm that (i) the online controller is essential; a static schedule yields only 3–4 % accuracy gain, (ii) token‑level refinement contributes most of the compute savings, and (iii) the confidence‑aware voting significantly closes the existence‑selection gap.
In summary, HyPER provides a unified, training‑free mechanism that (1) dynamically balances exploration and exploitation via an online controller, (2) exploits MoE routing for efficient token‑level refinement without full‑path resampling, and (3) employs a length‑ and confidence‑aware aggregation strategy to improve answer selection. The framework works with off‑the‑shelf post‑trained LLMs, requires no additional fine‑tuning, and demonstrates that adaptive, phase‑aware control can substantially improve the accuracy‑compute trade‑off in test‑time reasoning. Future work may extend HyPER to non‑MoE architectures and more complex multi‑step problem domains.
Comments & Academic Discussion
Loading comments...
Leave a Comment