On the optimization dynamics of RLVR: Gradient gap and step size thresholds
Reinforcement Learning with Verifiable Rewards (RLVR), which uses simple binary feedback to post-train large language models, has found significant empirical success. However, a principled understanding of why it works is lacking. This paper builds a theoretical foundation for RLVR by analyzing its training process at both the full-response (trajectory) and token levels. Central to our analysis is a new quantity called the Gradient Gap, which formalizes the direction of improvement from low-reward to high-reward regions of the response space. We prove that convergence critically depends on aligning the update direction with this Gradient Gap. Moreover, we derive a sharp step-size threshold based on the magnitude of the Gradient Gap: below it, learning converges, whereas above it, performance collapses. Our theory further predicts how the critical step size must scale with response length and the success rate, thereby explaining why practical heuristics such as length normalization improve stability and showing that, with a fixed learning rate, the success rate can stagnate strictly below $100%$. Importantly, our theory holds flexibly for any policy-gradient algorithm and so characterizes the dynamics of popular approaches such as REINFORCE and GRPO. We validate these predictions through controlled bandit simulations and language model experiments on post-training Qwen2.5-Math-7B with GRPO.
💡 Research Summary
This paper provides a rigorous theoretical analysis of Reinforcement Learning with Verifiable Rewards (RLVR), a paradigm that fine‑tunes large language models (LLMs) using only binary success/failure feedback. While RLVR has demonstrated impressive empirical gains, a solid understanding of its optimization dynamics has been missing. The authors fill this gap by introducing the concept of a Gradient Gap, which captures the direction in parameter space that moves probability mass from low‑reward (incorrect) responses to high‑reward (correct) responses.
Core definitions
- The response space is split into a “good” set (O^+) (reward = 1) and a “bad” set (O^-) (reward = 0).
- Conditional policies (\pi^+\theta) and (\pi^-\theta) describe the distribution of tokens inside each set.
- The expected score‑function gradients inside the two regions are (g^+(\theta)=\mathbb{E}_{\pi^+}
Comments & Academic Discussion
Loading comments...
Leave a Comment