More Bang for the Buck: Process Reward Modeling with Entropy-Driven Uncertainty
We introduce the Entropy-Driven Uncertainty Process Reward Model (EDU-PRM), a novel entropy-driven training framework for process reward modeling that enables dynamic, uncertainty-aligned segmentation of complex reasoning steps, eliminating the need for costly manual step annotations. Unlike previous Process Reward Models (PRMs) that rely on static partitioning and human labeling, EDU-PRM automatically anchors step boundaries at tokens with high predictive entropy, effectively capturing intrinsic logical transitions and facilitating efficient exploration of diverse reasoning paths. On the ProcessBench benchmark, EDU-PRM outperforms strong public PRM baselines, such as Math-Shepherd PRM and Omega PRM, and EDU-PRM achieves comparable results with SOTA models while only using 1.5% training data. Furthermore, by leveraging our proposed EDU sampling strategy, we observe accuracy boosts from 64.7% to 67.3% for generative reasoning tasks, accompanied by a reduction of 32% in token usage. These findings underscore the potential of EDU-PRM as a scalable and annotation-efficient paradigm for process supervision in mathematical reasoning, paving the way for more efficient and robust approaches to complex mathematical problem solving.
💡 Research Summary
The paper introduces the Entropy‑Driven Uncertainty Process Reward Model (EDU‑PRM), a novel framework for training process reward models that eliminates the need for costly human‑annotated step labels. Traditional Process Reward Models (PRMs) rely on static partitioning heuristics (e.g., blank lines, punctuation) or extensive human/LLM labeling to define intermediate reasoning steps. This creates two major problems: (1) the high expense and scalability limits of acquiring step‑level annotations, and (2) the “cheating” phenomenon where high step scores do not guarantee a correct final answer, undermining the reliability of stepwise supervision.
EDU‑PRM addresses these issues by using token‑level predictive entropy as an active signal to locate “uncertainty anchors.” At each decoding step t, the model computes the entropy Hₜ of the softmax distribution over the vocabulary. When Hₜ exceeds a predefined threshold τ, the token is treated as a logical transition point. The model then branches into two continuations using the top‑2 logits, while subsequent tokens are generated greedily until the next anchor is encountered. This process yields a binary tree of reasoning fragments, each bounded by high‑entropy anchors.
After the tree is built, each fragment is assigned a binary correctness label (0 for incorrect, 1 for correct) based solely on the validity of the final solution, using Monte‑Carlo Estimation (MCE). No intermediate human or LLM judgments are required; the only supervision signal is whether the overall answer is right. The resulting (question, fragment, label) triples are used to train a classifier‑style PRM with a standard cross‑entropy loss.
The authors implement EDU‑PRM on two model sizes, Qwen2.5‑72B‑Base and Qwen2.5‑7B‑Base. They construct a training corpus from the MA TH dataset: 7,500 problems are sampled, each with up to 100 candidate solutions generated via EDU‑sampling (entropy threshold = 1.0). This yields roughly 1.42 million training instances with a balanced distribution of hard (52 %) and soft (48 %) labels.
Evaluation is performed on two fronts. First, the authors directly assess PRM judgment accuracy on the ProcessBench benchmark, which measures the ability to predict whether a given solution is correct. EDU‑PRM‑72B achieves 88.4 % accuracy on the MA TH test set, surpassing the strong baseline Qwen2.5‑Math‑PRM‑72B (87.8 %). It matches or exceeds other baselines on GSM8K (94.2 %) and OLYMPIA (77.2 %). The 7B version performs slightly lower, reflecting capacity constraints.
Second, the authors test PRMs as selectors in a Best‑of‑N (BoN) setting. For each query, 128 candidate solutions are generated by a Qwen2‑7B‑Instruct model. Greedy‑EDU‑PRM (both 7B and 72B) selects the most promising answer based on its step‑wise scores. Compared with traditional high‑temperature (HT) sampling, EDU‑sampling reduces token consumption by 32 % while raising accuracy from 64.7 % to 67.3 % on generative reasoning tasks. The “Pruning‑EDU” variant further cuts computation by discarding low‑scoring branches early, concentrating resources on promising trajectories.
Key insights include: (1) Entropy serves as an effective, model‑intrinsic cue for logical step boundaries, outperforming static heuristics; (2) Monte‑Carlo fragment‑level rewards align stepwise evaluation with final answer correctness, mitigating the cheating problem; (3) The approach is annotation‑efficient, requiring only final‑answer labels, yet scales to large model sizes and datasets; (4) During inference, EDU‑sampling provides a principled alternative to high‑temperature sampling, delivering comparable or better accuracy with substantially lower token budgets.
The paper positions EDU‑PRM as a scalable, cost‑effective paradigm for process supervision in mathematical reasoning. By leveraging uncertainty as a dynamic segmentation tool and avoiding manual step annotations, it opens avenues for broader application to other domains such as code generation, scientific reasoning, or multimodal tasks. Future work may explore adaptive entropy thresholds, richer uncertainty metrics, and extensions to non‑mathematical reasoning contexts.
Comments & Academic Discussion
Loading comments...
Leave a Comment