On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.

💡 Research Summary

This paper provides a rigorous theoretical analysis of Reinforcement Learning with Verifiable Rewards (RLVR), a paradigm that fine‑tunes large language models (LLMs) using only binary success/failure feedback. While RLVR has demonstrated impressive empirical gains, a solid understanding of its optimization dynamics has been missing. The authors fill this gap by introducing the concept of a Gradient Gap, which captures the direction in parameter space that moves probability mass from low‑reward (incorrect) responses to high‑reward (correct) responses.

Core definitions

The response space is split into a “good” set (O^+) (reward = 1) and a “bad” set (O^-) (reward = 0).
Conditional policies (\pi^+\theta) and (\pi^-\theta) describe the distribution of tokens inside each set.
The expected score‑function gradients inside the two regions are (g^+(\theta)=\mathbb{E}_{\pi^+}

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

💡 Research Summary

Comments & Academic Discussion

Leave a Comment