ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function’s weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.


💡 Research Summary

The paper introduces ADORA (Advantage Dynamics via Online Rollout Adaptation), a novel framework that dynamically adjusts advantage estimation during reinforcement‑learning (RL) fine‑tuning of reasoning models. Traditional policy‑gradient methods such as PPO and GRPO compute a per‑sample advantage once (often using normalized rewards) and treat it as static throughout training. This static treatment ignores the fact that a sample’s learning utility changes as the policy improves, leading to inefficient credit assignment, slower convergence, and instability—especially in long chain‑of‑thought (CoT) tasks where shallow, high‑reward shortcuts can dominate learning signals.

ADORA addresses this by classifying each training example on the fly into Temporarily Advantageous Samples (TAS) or Temporarily Disadvantageous Samples (TDS) based on two criteria derived from live rollouts:

  1. Length Advantage – a sample is considered length‑advantageous if the longest successful rollout length exceeds the average length of failed rollouts. Longer successful trajectories are assumed to reflect deeper reasoning.
  2. Difficulty Advantage – a sample is difficulty‑advantageous if its success rate lies between 0 and a predefined threshold τ (e.g., 0.6). This ensures that the model focuses on samples that are still challenging for its current competence.

For TAS, ADORA retains the original advantage (weight = 1). For TDS, it applies an attenuation factor λ_att ∈ (0, 1) (e.g., 0.5), scaling the advantage down: ˜A = w·A. Because the weight is applied at the sample level, the unbiased nature of the policy gradient is preserved.

The framework further distinguishes between Visual Language Models (VLMs) and Large Language Models (LLMs). VLMs, being weaker reasoners early in training, tend to over‑fit to short, easy rollouts; ADORA therefore heavily attenuates TDS to suppress noisy signals. LLMs already possess stronger reasoning; during RL they benefit more from emphasizing TAS to break performance plateaus.

Extensive experiments cover a wide spectrum of model families (Llama‑3, Mistral, DeepSeek, InternVL) and architectures (dense, MoE) across both mathematical and geometric reasoning benchmarks. Key results include:

  • On the Qwen‑7B base model for math tasks, ADORA improves average performance by 3.4 percentage points over vanilla GRPO.
  • For VLMs, using fewer than 2 k samples, Qwen2.5‑VL‑7B reaches 73.5 % accuracy on MathVista with ADORA, a 4–5 % absolute gain over GRPO.
  • Ablation studies show that the attenuation factor λ_att is robust between 0.5–0.8, and that combining Length and Difficulty advantages yields the largest gains compared to using either alone.
  • ADORA integrates seamlessly with PPO, GRPO, and other policy‑optimization algorithms without architectural changes, confirming its plug‑and‑play nature.

The authors also compare ADORA to Generalized Advantage Estimation (GAE) and demonstrate that dynamic re‑weighting consistently outperforms static estimators, even when GAE is employed.

In summary, ADORA provides a simple yet powerful mechanism to capture the evolving utility of training samples during RL‑based reasoning model fine‑tuning. By dynamically scaling advantage signals based on live rollout statistics, it improves the signal‑to‑noise ratio of policy gradients, accelerates convergence, and stabilizes training across diverse model sizes and modalities. The work opens avenues for further research into automated threshold selection, multi‑modal sample interaction modeling, and large‑scale deployment optimizations.


Comments & Academic Discussion

Loading comments...

Leave a Comment