Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages
Recent advances in flow matching models, particularly with reinforcement learning (RL), have significantly enhanced human preference alignment in few step text to image generators. However, existing RL based approaches for flow matching models typically rely on numerous denoising steps, while suffering from sparse and imprecise reward signals that often lead to suboptimal alignment. To address these limitations, we propose Temperature Annealed Few step Sampling with Group Relative Policy Optimization (TAFS GRPO), a novel framework for training flow matching text to image models into efficient few step generators well aligned with human preferences. Our method iteratively injects adaptive temporal noise onto the results of one step samples. By repeatedly annealing the model’s sampled outputs, it introduces stochasticity into the sampling process while preserving the semantic integrity of each generated image. Moreover, its step aware advantage integration mechanism combines the GRPO to avoid the need for the differentiable of reward function and provide dense and step specific rewards for stable policy optimization. Extensive experiments demonstrate that TAFS GRPO achieves strong performance in few step text to image generation and significantly improves the alignment of generated images with human preferences. The code and models of this work will be available to facilitate further research.
💡 Research Summary
The paper introduces TAFS‑GRPO (Temperature‑Annealed Few‑step Sampling with Group Relative Policy Optimization), a novel framework for aligning flow‑matching text‑to‑image models with human preferences while drastically reducing the number of denoising steps required at inference time. Flow‑matching models, which solve an ordinary differential equation (ODE) to generate images, typically need 20‑40 steps to produce high‑quality results. This creates two major problems for reinforcement‑learning (RL) based alignment: (1) the sequential nature of the generation makes training and online RL extremely costly, and (2) reward signals are sparse because meaningful feedback can only be computed after the full trajectory is completed.
TAFS‑GRPO tackles both issues with two complementary components. First, “temperature‑annealed few‑step sampling” replaces the conventional multi‑step denoising pipeline with a series of one‑step sampling operations interleaved with adaptive Gaussian noise. Starting from the initial noise x_T, a one‑step flow model prediction yields a preliminary image x₁⁰. The total diffusion interval T is divided into N equal sub‑intervals of length τ = T/N. At each sub‑step a noise term ε_{T‑kτ} is added, and the flow model is queried again to produce the next intermediate image x_{k+1}⁰ = x_k⁰ + ε_{T‑kτ} + (T‑kτ)·v_θ(x_k⁰ + ε_{T‑kτ}, T‑kτ). This process introduces stochasticity required for policy‑gradient methods while preserving semantic content at every intermediate step, thereby turning a previously “terminal‑only” reward into a dense, step‑wise signal.
Second, the “step‑aware advantage integration” module couples this dense sampling with Group Relative Policy Optimization (GRPO). In standard GRPO a group of G samples generated by the old policy π_{θ_old} is evaluated with a single scalar reward, and the same reward is assigned to all timesteps, leading to uniform credit assignment. TAFS‑GRPO instead evaluates each intermediate image with an arbitrary (possibly non‑differentiable) reward function R (e.g., human preference scores, LLM‑based aesthetic estimators). The per‑step rewards r_k = R(x_k⁰) are normalized to produce advantages ĤA_k = (r_k ‑ mean(r))/std(r). These advantages are then used in the GRPO objective: J_{GRPO} = E
Comments & Academic Discussion
Loading comments...
Leave a Comment