Flow Policy Gradients for Robot Control
Likelihood-based policy gradient methods are the dominant approach for training robot control policies from rewards. These methods rely on differentiable action likelihoods, which constrain policy outputs to simple distributions like Gaussians. In this work, we show how flow matching policy gradients – a recent framework that bypasses likelihood computation – can be made effective for training and fine-tuning more expressive policies in challenging robot control settings. We introduce an improved objective that enables success in legged locomotion, humanoid motion tracking, and manipulation tasks, as well as robust sim-to-real transfer on two humanoid robots. We then present ablations and analysis on training dynamics. Results show how policies can exploit the flow representation for exploration when training from scratch, as well as improved fine-tuning robustness over baselines.
💡 Research Summary
The paper tackles a fundamental limitation of current reinforcement‑learning (RL) methods for robot control: the reliance on simple, analytically tractable action distributions such as diagonal Gaussians. While normalizing‑flow (NF) policies can represent far richer, multimodal distributions, computing their likelihoods requires costly volume‑change calculations that become prohibitive in high‑dimensional, continuous‑control settings. To bypass likelihood computation, the authors build on the recently introduced Flow Matching Policy Optimization (FPO) framework, which replaces the likelihood ratio in PPO with an exponential of the difference between conditional flow‑matching (CFM) losses evaluated under the old and new policies.
Although FPO works on synthetic benchmarks, the authors find it unstable when applied to realistic robotics tasks involving high‑dimensional action spaces, joint limits, torque constraints, and sparse reward signals. They therefore propose an enhanced algorithm, FPO++, that introduces two key modifications: (1) per‑sample ratio clipping and (2) an asymmetric trust‑region objective (ASPO).
In standard FPO, multiple Monte‑Carlo (τ, ε) pairs are drawn for each action, the CFM loss differences are averaged, and a single ratio is clipped. This means that either all samples for a given action are clipped or none, which can be overly coarse when multiple gradient steps are taken. FPO++ computes a separate ratio for each (τ, ε) pair, applies clipping independently, and thus provides a finer‑grained trust region that better respects the stochastic nature of the flow field.
For the trust‑region, the authors observe that PPO’s symmetric clipping treats positive‑advantage and negative‑advantage samples identically, zero‑ing gradients when the ratio exceeds the clipping bound. This can cause aggressive reductions in action likelihood for negative‑advantage samples, harming exploration and entropy. ASPO replaces the PPO clipping for negative‑advantage samples with the Simple Policy Optimization (SPO) objective, which penalizes deviations from the trust‑region but still yields a gradient that pulls the ratio back toward 1. Positive‑advantage samples continue to use standard PPO clipping. This asymmetry discourages large increases in the CFM loss (i.e., large KL divergence) for actions that performed poorly, while still encouraging reduction of the loss for good actions.
A further practical contribution is “zero‑sampling”. During training, policies explore by sampling initial noise ε ∼ N(0, I) and integrating the learned flow field. At test time and for evaluation, the authors set ε = 0, effectively turning the stochastic flow policy into a deterministic one. Empirically, zero‑sampling improves success rates on manipulation benchmarks and yields more stable gait execution on real robots.
The experimental suite covers three domains: (i) locomotion on four robots (two quadrupeds: Unitree Go2 and Boston Dynamics Spot; two humanoids: Unitree H1 and G1) using the IsaacLab velocity‑conditioned environments, (ii) sim‑to‑real transfer on two humanoid platforms (Booster T1 for velocity‑conditioned walking and Unitree G1 for whole‑body motion tracking using the LAFAN dataset), and (iii) fine‑tuning of image‑based flow policies for manipulation tasks (box cleanup, tray lift, etc.).
Key findings include:
- Stability: Across all four locomotion tasks, FPO++ consistently learns high‑return policies, whereas vanilla FPO frequently collapses or gets stuck in local minima despite extensive hyper‑parameter sweeps.
- Sim‑to‑Real Success: Deployed policies on real hardware exhibit stable gaits, accurate motion tracking over long sequences, and robustness to external pushes, confirming that the flow representation transfers well when combined with domain randomization.
- Fine‑Tuning Performance: Starting from a pretrained image‑flow policy, FPO++ achieves >80 % success on manipulation benchmarks, outperforming two DPPO variants (fixed‑noise and learned‑noise) by a large margin. Zero‑sampling further boosts success rates compared to random‑sampling evaluation.
- Ablations: Removing per‑sample clipping or the asymmetric SPO component degrades performance, confirming that both design choices contribute to the observed robustness.
Overall, the paper demonstrates that flow‑based policies, when trained with the proposed FPO++ algorithm, can handle the complexities of real‑world robot control—high‑dimensional continuous actions, strict physical constraints, and the need for reliable sim‑to‑real transfer—while preserving the expressive power of normalizing flows. The work opens the door to richer stochastic policies for exploration, more effective fine‑tuning of pretrained models, and broader adoption of flow architectures in RL for robotics. Future directions include scaling to higher‑DoF systems, integrating multimodal sensor inputs, and combining offline data with online RL in a hybrid learning pipeline.
Comments & Academic Discussion
Loading comments...
Leave a Comment