From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning

From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning has become a cornerstone for enhancing the reasoning capabilities of Large Language Models, where group-based approaches such as GRPO have emerged as efficient paradigms that optimize policies by leveraging intra-group performance differences. However, these methods typically rely on absolute numerical rewards, introducing intrinsic limitations. In verifiable tasks, identical group evaluations often result in sparse supervision, while in open-ended scenarios, the score range instability of reward models undermines advantage estimation based on group means. To address these limitations, we propose Reinforcement Learning with Relative Rewards (RLRR), a framework that shifts reward shaping from absolute scoring to relative ranking. Complementing this framework, we introduce the Ranking Reward Model, a listwise preference model tailored for group-based optimization to directly generate relative rankings. By transforming raw evaluations into robust relative signals, RLRR effectively mitigates signal sparsity and reward instability. Experimental results demonstrate that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.


💡 Research Summary

This paper investigates a fundamental limitation of recent group‑based reinforcement learning (RL) methods for large language models (LLMs), exemplified by Group Relative Policy Optimization (GRPO). GRPO generates a set of responses for each prompt, evaluates each response with an absolute scalar reward, and computes a group‑wise advantage by subtracting the group mean (optionally normalized by the group standard deviation). While this design is memory‑efficient and works well with Proximal Policy Optimization (PPO)‑style clipping, it suffers from two critical problems.

First, in tasks that can be verified with rule‑based checkers (e.g., mathematics or coding), the reward is binary (correct/incorrect). As training progresses, most sampled groups become homogeneous—either all correct or all incorrect. Consequently, the intra‑group reward variance collapses, the advantage becomes zero, and the proportion of “effective samples” (groups with mixed outcomes) drops below 40 % in later stages. This sparsity dramatically reduces data efficiency.

Second, in open‑ended generation tasks, scalar reward models (SRMs) produce unbounded scores. The advantage computation, which relies on the mean and standard deviation of these scores, becomes highly sensitive to the scale of the SRM outputs. Large fluctuations in the score range destabilize updates and can cause erratic policy changes.

To address both issues, the authors propose Reinforcement Learning with Relative Rewards (RLRR), a framework that replaces absolute scoring with intra‑group ranking. The central idea is to derive a ranking for each response within a group and use that ranking as the primary learning signal. Two concrete reward‑shaping functions are introduced.

  1. Hybrid Relative Reward (HRR) – for verifiable tasks. The binary rule‑based score (s_{\text{rule}}) is retained, and a rank‑dependent correction is added:
    \

Comments & Academic Discussion

Loading comments...

Leave a Comment