On the Plasticity and Stability for Post-Training Large Language Models
Training stability remains a critical bottleneck for Group Relative Policy Optimization (GRPO), often manifesting as a trade-off between reasoning plasticity and general capability retention. We identify a root cause as the geometric conflict between plasticity and stability gradients, which leads to destructive interference. Crucially, we argue that deterministic projection methods are suboptimal for GRPO as they overlook the intrinsic stochasticity of group-based gradient estimates. To address this, we propose Probabilistic Conflict Resolution (PCR), a Bayesian framework that models gradients as random variables. PCR dynamically arbitrates conflicts via an uncertainty-aware ``soft projection’’ mechanism, optimizing the signal-to-noise ratio. Extensive experiments demonstrate that PCR significantly smooths the training trajectory and achieves superior performance in various reasoning tasks.
💡 Research Summary
The paper tackles a long‑standing stability problem in Group Relative Policy Optimization (GRPO), a reinforcement‑learning based fine‑tuning method for large language models (LLMs). GRPO simultaneously optimizes two competing objectives: a “plasticity” loss that pushes the policy toward higher task‑specific reward, and a “stability” loss (a KL penalty) that keeps the policy close to its original distribution. The authors first decompose the GRPO objective into these two terms and derive the corresponding gradients, gₚₗₐ (plasticity) and gₛₜₐ (stability). Empirical analysis on an 8‑billion‑parameter model (DeepSeek‑R1‑Distill‑Llama‑8B) shows that, especially in the middle‑to‑deep MLP layers, the cosine similarity between gₚₗₐ and gₛₜₐ is persistently negative, meaning the two gradients point in opposite directions and cancel each other out. This destructive interference explains the extreme sensitivity of training to the KL coefficient β and the frequent oscillations observed in practice.
Existing gradient‑conflict resolution methods such as PCGrad treat gradients as deterministic vectors and apply a hard geometric projection. However, in GRPO the gradients are Monte‑Carlo estimates derived from a small group of sampled queries, so they are noisy. Blindly projecting noisy gradients can discard useful learning signals or enforce incorrect constraints.
To address this, the authors propose Probabilistic Conflict Resolution (PCR). They model each gradient as a multivariate Gaussian random variable, justified by the Central Limit Theorem applied to the averaging over query groups. The mean of each Gaussian is approximated by the observed gradient, while the covariance is simplified to an isotropic scalar variance estimated from intra‑group gradient variance. Assuming conditional independence between the two gradient estimates, they derive a Bayesian arbitration rule that computes a soft‑projection weight based on the signal‑to‑noise ratio of the two distributions. The resulting update rule is a closed‑form expression that scales the conflicting component rather than removing it entirely, thereby minimizing the mean‑squared error between the true gradient and the estimated update.
Because applying PCR to every parameter would be costly, the authors implement a hybrid scheme: PCR is applied only to the MLP layers (the primary knowledge‑storage components), while the attention layers continue to use the standard GRPO update. This design preserves the plasticity needed for reasoning while stabilizing the core knowledge.
Theoretical analysis shows that PCR is the optimal estimator in the gradient space under the assumed Gaussian model, achieving the best bias‑variance trade‑off. Empirically, PCR dramatically smooths training curves, reduces oscillations, and expands the Pareto frontier between reasoning performance (AIME Pass@1, MMLU) and language fluency (WikiText‑2 perplexity). Across a range of KL coefficients, PCR consistently outperforms vanilla GRPO by 2–4% absolute gain on reasoning metrics and yields lower perplexity, while also being less sensitive to hyper‑parameter tuning. A scalability test on a 70‑billion‑parameter model confirms that limiting PCR to MLP layers incurs negligible memory or compute overhead.
In conclusion, the paper identifies the geometric conflict between plasticity and stability gradients as the root cause of GRPO instability, and introduces a principled probabilistic framework that resolves this conflict via uncertainty‑aware soft projection. PCR offers a practical, theoretically grounded solution that improves stability and performance of post‑training LLMs, and opens avenues for further work on richer uncertainty modeling and extensions to other RL‑based fine‑tuning algorithms.
Comments & Academic Discussion
Loading comments...
Leave a Comment