Discovering Process-Outcome Credit in Multi-Step LLM Reasoning

Discovering Process-Outcome Credit in Multi-Step LLM Reasoning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning (RL) serves as a potent paradigm for enhancing reasoning capabilities in Large Language Models (LLMs), yet standard outcome-based approaches often suffer from reward sparsity and inefficient credit assignment. In this paper, we propose a novel framework designed to provide continuous reward signals, which introduces a Step-wise Marginal Information Gain (MIG) mechanism that quantifies the intrinsic value of reasoning steps against a Monotonic Historical Watermark, effectively filtering out training noise. To ensure disentangled credit distribution, we implement a Decoupled Masking Strategy, applying process-oriented rewards specifically to the chain-of-thought (CoT) and outcome-oriented rewards to the full completion. Additionally, we incorporate a Dual-Gated SFT objective to stabilize training with high-quality structural and factual signals. Extensive experiments across textual and multi-modal benchmarks (e.g., MATH, Super-CLEVR) demonstrate that our approach consistently outperforms baselines such as GRPO in both sample efficiency and final accuracy. Furthermore, our model exhibits superior out-of-distribution robustness, demonstrating promising zero-shot transfer capabilities to unseen and challenging reasoning tasks.


💡 Research Summary

The paper tackles the long‑standing issue of reward sparsity and credit assignment in reinforcement‑learning (RL) approaches for large language model (LLM) reasoning. Instead of relying solely on a binary terminal reward, the authors introduce a Step‑wise Marginal Information Gain (MIG) mechanism that provides dense, intrinsic feedback for each reasoning step. MIG is computed by measuring the log‑likelihood of the ground‑truth answer conditioned on the current reasoning prefix and comparing it to a monotonic historical watermark that records the best likelihood achieved so far. Only when a step yields a genuine increase in this likelihood does it receive a positive reward, ensuring that logical breakthroughs are rewarded regardless of their position and preventing “pump‑and‑dump” reward hacking.

To keep process exploration and outcome correctness from conflicting, the training objective is split into three components: (1) L_MIG, which applies the MIG‑derived advantage to tokens inside a Chain‑of‑Thought mask, encouraging diverse and semantically meaningful reasoning paths; (2) L_Outcome, a conventional GRPO‑style loss that combines a binary correctness signal with a format‑compliance reward, applied to the entire sequence via a complementary mask; and (3) L_Gated‑SFT, a self‑supervised distillation term gated by both structural validity and answer correctness, guaranteeing that only high‑quality trajectories are used for supervised fine‑tuning. This decoupled masking strategy cleanly separates process‑oriented and result‑oriented learning signals.

The framework is modality‑agnostic: the same MIG computation can be applied to multimodal tasks, enabling the method to scale from pure text benchmarks (GSM8K, MATH, TAL‑SCQ5K) to vision‑language datasets (CMM‑Math, ChartQA, Super‑CLEVR). Experiments across eight training datasets and six out‑of‑distribution benchmarks demonstrate consistent improvements in sample efficiency and final accuracy over strong baselines such as GRPO, DAPO, and GSPO. The authors also address answer‑variant issues in mathematics by extending the likelihood calculation to a set of semantically equivalent solutions, further enhancing robustness.

Overall, the work presents a principled, dense reward signal derived from the model’s own probability distribution, a dual‑gated SFT scheme for high‑fidelity data distillation, and a decoupled optimization that together enable LLMs to explore richer reasoning trajectories while staying anchored to correct outcomes. This contributes a significant step toward autonomous, scalable, and trustworthy reasoning in large language models.


Comments & Academic Discussion

Loading comments...

Leave a Comment