Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling

Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Many reinforcement learning (RL) problems admit multiple terminal solutions of comparable quality, where the goal is not to identify a single optimum but to represent a diverse set of high-quality outcomes. Nevertheless, policies trained by standard expected return maximization routinely collapse onto a small subset of outcomes, a phenomenon commonly attributed to insufficient exploration or weak regularization. We show that this explanation is incomplete: outcome level mode collapse is a structural consequence of the expected-return objective itself. Under idealized learning dynamics, the log-probability ratio between any two outcomes evolves linearly in their reward difference, implying exponential ratio divergence and inevitable collapse independent of the exploration strategy, entropy regularization, or optimization algorithm. We identify the source of this pathology as the probability multiplier inside the expectation and propose a minimal correction: inverse probability scaling, which removes outcome-frequency amplification from the learning signal, fundamentally changes the learning dynamics, and provably yields reward-proportional terminal distributions, preventing collapse in multimodal settings. We instantiate this principle in Group Relative Policy Optimization (GRPO) as a drop-in modification, IPS-GRPO, requiring no auxiliary models or architectural changes. Across different reasoning and molecular generation tasks, IPS-GRPO consistently reduces outcome-level mode collapse while matching or exceeding baseline performance, suggesting that correcting the objective rather than adding exploration heuristics is key to reliable multimodal policy optimization.


💡 Research Summary

The paper tackles a pervasive issue in reinforcement learning (RL) when multiple terminal solutions of comparable quality exist: outcome‑level mode collapse, where a policy converges to a narrow subset of high‑reward outcomes despite the presence of many equally good alternatives. While prior work attributes this phenomenon to insufficient exploration, weak regularization, or sub‑optimal hyper‑parameters, the authors demonstrate that the root cause lies in the expected‑return objective itself.

Theoretical Insight
The authors formalize the problem as an episodic bandit where each trajectory deterministically yields a terminal outcome o with reward r(o). The standard objective is J(θ)=∑ₒ pθ(o) r(o), where pθ(o) is the policy‑induced probability of outcome o. Using a softmax parameterization (logits z) and continuous‑time gradient flow, they derive the dynamics of the log‑probability ratio between any two outcomes i and j:

d/dt log


Comments & Academic Discussion

Loading comments...

Leave a Comment