LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

💡 Research Summary

**
The paper tackles a fundamental obstacle in aligning diffusion‑based large language models (dLLMs): the intractability of exact likelihood computation, which forces existing reinforcement‑learning‑with‑verifiable‑rewards (RLVR) methods to rely on high‑variance approximations via ODE/SDE discretization. To overcome this, the authors introduce Likelihood‑Free Policy Optimization (LFPO), a framework that maps the continuous flow‑matching (FM) paradigm onto the discrete token space of masked diffusion models.

In LFPO, each vocabulary token is viewed as a vertex on a probability simplex, while the

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment