Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching
Diffusion policies are expressive yet incur high inference latency. Flow Matching (FM) enables one-step generation, but integrating it into Maximum Entropy Reinforcement Learning (MaxEnt RL) is challenging: the optimal policy is an intractable energy-based distribution, and the efficient log-likelihood estimation required to balance exploration and exploitation suffers from severe discretization bias. We propose \textbf{F}low-based \textbf{L}og-likelihood-\textbf{A}ware \textbf{M}aximum \textbf{E}ntropy RL (\textbf{FLAME}), a principled framework that addresses these challenges. First, we derive a Q-Reweighted FM objective that bypasses partition function estimation via importance reweighting. Second, we design a decoupled entropy estimator that rigorously corrects bias, which enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Third, we integrate the MeanFlow formulation to achieve expressive and efficient one-step control. Empirical results on MuJoCo show that FLAME outperforms Gaussian baselines and matches multi-step diffusion policies with significantly lower inference cost. Code is available at https://github.com/lzqw/FLAME.
💡 Research Summary
The paper introduces FLAME, a novel framework that unifies flow‑matching (FM) generative policies with maximum‑entropy reinforcement learning (MaxEnt RL) in a single‑step, low‑latency setting. Traditional diffusion‑based policies achieve expressive, multimodal action distributions but require dozens of neural‑network evaluations per action (high NFE), making them unsuitable for high‑frequency control. FM can generate actions in one ODE integration step, yet its standard formulation assumes access to target samples, which MaxEnt RL does not provide because the optimal policy is an energy‑based model (EBM) with an intractable partition function. Moreover, MaxEnt RL needs accurate entropy terms (log‑likelihoods) for policy evaluation; naïve one‑step likelihood approximations introduce severe discretization bias that destabilizes value learning.
FLAME tackles these issues through three core contributions. First, it derives a Q‑Reweighted Flow Matching (QRFM) objective that eliminates the partition function by importance reweighting. By introducing a strictly positive weighting function g(a, s) and exploiting the invariance of the FM loss under such reweighting, the authors construct g that cancels both the unknown normalizer Z(s) and the marginal density p_t(a|s). Using a reverse‑sampling trick, they draw an intermediate action a_t from a tractable proposal h_t(·|s) and then sample a terminal action a_1 from the analytically known reverse conditional distribution ϕ_{1|t}(a_1|a_t). The resulting loss is proportional to exp(Q(s,a_1)/α) · ‖u_θ(a_t,t,s) − (a_1 − a_0)‖², allowing the policy to be trained directly from soft Q‑values without ever sampling from the intractable MaxEnt target.
Second, the paper addresses the bias in log‑likelihood estimation for continuous flows. It proposes two complementary estimators. FLAME‑R employs an augmented‑ODE formulation that integrates the negative divergence of the learned velocity field exactly, yielding an unbiased log‑density estimate suitable for policy evaluation. FLAME‑M adopts a decoupled multi‑step estimator: during training it computes accurate entropy using multi‑step integration, while at deployment it retains the one‑step inference advantage of FM. This separation preserves the exploration benefits of the entropy bonus without incurring the computational overhead of multi‑step integration at test time.
Third, FLAME incorporates the MeanFlow framework, which learns an average velocity field along flow trajectories to reduce curvature of the marginal paths. By training on the average velocity rather than the instantaneous field, MeanFlow enables high‑fidelity action generation with a single ODE step (NFE = 1), effectively bridging the gap between expressive generative models and real‑time control constraints.
The methodology is detailed as follows: (1) Soft Q‑functions are updated via the standard soft Bellman backup, including the entropy term α log π_θ(a|s). (2) The QRFM (or its MeanFlow variant QRMF) loss is minimized using samples (t, a_t, a_1) generated by the reverse‑sampling procedure, with importance weights exp(Q(s,a_1)/α). (3) Policy evaluation uses either the exact augmented‑ODE log‑likelihood (FLAME‑R) or the multi‑step estimator (FLAME‑M). The overall algorithm alternates between Q‑updates and policy updates, preserving the MaxEnt policy‑iteration loop while leveraging expressive flow‑based policies.
Empirical evaluation on a suite of MuJoCo continuous‑control benchmarks (e.g., Hopper, Walker2d, HalfCheetah) demonstrates that both FLAME‑R and FLAME‑M outperform the Gaussian‑policy baseline Soft Actor‑Critic (SAC) in terms of average return, and achieve performance comparable to multi‑step diffusion policies while using a single ODE integration step. Notably, in environments that require multimodal action distributions, FLAME maintains diverse exploration without mode collapse, and the entropy estimates closely match the target coefficient α, confirming the effectiveness of the bias‑corrected estimators.
In summary, FLAME provides a principled solution to integrate flow‑matching generative models into MaxEnt RL: it removes the intractable partition function via Q‑reweighting, supplies unbiased or bias‑corrected entropy estimates, and leverages MeanFlow to achieve one‑step, low‑latency control. Limitations include reliance on a Gaussian base distribution and linear optimal‑transport coupling; extending the framework to non‑linear couplings or high‑dimensional action spaces remains an open research direction. The codebase is released publicly, facilitating further exploration of flow‑based RL in real‑time robotic applications.
Comments & Academic Discussion
Loading comments...
Leave a Comment