Simple Policy Gradients for Reasoning with Diffusion Language Models

Simple Policy Gradients for Reasoning with Diffusion Language Models
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO’s effectiveness on different math and reasoning tasks, achieving +9.9% absolute gain on GSM8K, +4.6% on MATH-500, +59.4% on Countdown, and +69.7% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.


💡 Research Summary

The paper tackles a central obstacle in applying reinforcement‑learning‑based fine‑tuning to diffusion language models (dLLMs). Unlike autoregressive (AR) LLMs, dLLMs generate text by iteratively unmasking tokens, which makes the exact sequence‑level likelihood intractable. Existing post‑training methods therefore replace the true likelihood with ELBO‑type lower bounds, introducing bias into policy‑gradient updates and limiting the effectiveness of RL‑VR (reinforcement learning with verifiable rewards) for reasoning tasks.

The authors propose Amortized Group Relative Policy Optimization (AGRPO), a novel policy‑gradient algorithm that aligns directly with the multi‑step Markov decision process (MDP) inherent to dLLM generation. In this MDP, each state is a partially masked sequence, each action consists of the set of tokens unmasked at a given step, and the reward is given only after the full sequence is unmasked (i.e., after solving the problem). Because the model explicitly provides the probability of each unmasking action, the exact action likelihood can be used in the gradient, eliminating the need for ELBO approximations.

Key technical contributions include:

  1. Multi‑step MDP formulation – By treating each denoising step as a separate decision, the authors obtain exact per‑step likelihoods, enabling unbiased REINFORCE‑style or PPO‑style updates.
  2. Time‑step amortization – Computing the full PPO surrogate over all m denoising steps would require m forward passes, which is prohibitive for large transformers. AGRPO samples a single timestep t uniformly from {1,…,m} for each trajectory, computes the importance‑sampling ratio ρₜ = π_θ(oₜ|·)/π_old(oₜ|·) and the group‑normalized advantage, and uses this as an unbiased estimator of the full sum. By drawing k « m timesteps per batch, memory usage is dramatically reduced while preserving unbiasedness.
  3. KL‑regularization without sequence‑level approximations – The KL term is also estimated at the sampled timestep, using an unbiased Monte‑Carlo estimator (Schulman, 2020). This keeps the policy close to a reference distribution and stabilizes training.
  4. Variance‑reduction tricks – The authors introduce two practical techniques: (a) baseline subtraction using the mean reward across a group of rollouts, and (b) stratified timestep sampling to decorrelate samples. Both reduce gradient variance and improve convergence.

The method is evaluated on four reasoning benchmarks: GSM8K, MATH‑500, Countdown, and Sudoku. Compared with the prior dLLM RL method diffu‑GRPO and ELBO‑based baselines, AGRPO yields large absolute gains (+9.9% on GSM8K, +4.6% on MATH‑500, +59.4% on Countdown, +69.7% on Sudoku). Importantly, models fine‑tuned with AGRPO retain high accuracy even when the number of sampling steps is reduced by a factor of four, demonstrating that the training procedure induces robustness to different inference configurations.

The paper also discusses practical implementation details, such as efficient retrieval of partially masked states, memory‑efficient batching, and the use of group‑normalized advantages to avoid scaling issues on easy or hard problems.

In conclusion, AGRPO provides a theoretically sound, unbiased, and computationally feasible way to apply RL‑VR to diffusion language models. By reframing generation as a multi‑step MDP and amortizing the gradient computation across timesteps, the authors bridge the gap between the strong pre‑training capabilities of dLLMs and the powerful post‑training fine‑tuning techniques that have propelled AR LLMs. This work opens the door for diffusion‑based models to achieve parity—or even superiority—on complex reasoning tasks, and suggests future directions such as step‑wise rewards, multimodal diffusion models, and more sophisticated unmasking strategies.


Comments & Academic Discussion

Loading comments...

Leave a Comment