Trajectory Consistency for One-Step Generation on Euler Mean Flows
We propose \emph{Euler Mean Flows (EMF)}, a flow-based generative framework for one-step and few-step generation that enforces long-range trajectory consistency with minimal sampling cost. The key idea of EMF is to replace the trajectory consistency constraint, which is difficult to supervise and optimize over long time scales, with a principled linear surrogate that enables direct data supervision for long-horizon flow-map compositions. We derive this approximation from the semigroup formulation of flow-based models and show that, under mild regularity assumptions, it faithfully approximates the original consistency objective while being substantially easier to optimize. This formulation leads to a unified, JVP-free training framework that supports both $u$-prediction and $x_1$-prediction variants, avoiding explicit Jacobian computations and significantly reducing memory and computational overhead. Experiments on image synthesis, particle-based geometry generation, and functional generation demonstrate improved optimization stability and sample quality under fixed sampling budgets, together with approximately $50%$ reductions in training time and memory consumption compared to existing one-step methods for image generation.
💡 Research Summary
This paper introduces Euler Mean Flows (EMF), a novel framework for one‑step and few‑step generative modeling that enforces long‑range trajectory consistency without the heavy computational burden of Jacobian‑vector products (JVPs). The authors begin by highlighting the central difficulty in one‑step generation: ensuring that flow maps across different time intervals satisfy the semigroup property ϕₜ→ᵣ = ϕₛ→ᵣ ∘ ϕₜ→ₛ. Existing approaches either compose short‑range transitions, which leads to error accumulation, or derive consistency losses from continuous equations (e.g., MeanFlow) but require explicit gradient calculations that are memory‑intensive and unstable under mixed‑precision training.
To overcome these limitations, EMF linearizes the semigroup constraint locally. By defining a mean velocity field uₜ→ᵣ = ϕₜ→ᵣ − (xᵣ − xₜ) and selecting a small step Δt, the authors apply a first‑order Taylor expansion to approximate the short‑range flow ϕₜ→ₛ. This yields a relation that expresses the long‑range mean velocity uₜ→ᵣ in terms of the instantaneous velocity uₜ→ₜ and the velocities at neighboring times. Crucially, the instantaneous velocity can be supervised directly from data using the conditional velocity uₜ(x | x₁) = x₁ − x_{1‑t}, which is readily computable from paired samples (x₀, x₁).
The resulting loss, denoted L_E(θ), combines three components: (1) a data‑driven term matching the model’s long‑range velocity u_θ^{t→r}(x) to the conditional instantaneous velocity, (2) a correction term that accounts for the linearized evolution over Δt, and (3) a stop‑gradient operator that blocks gradients through the correction term, thereby eliminating the need for JVPs. When the horizon r equals the current time t, the loss reduces exactly to the standard Flow Matching objective, showing that EMF naturally unifies u‑prediction (predicting the instantaneous velocity) and x₁‑prediction (directly supervising the terminal state).
Theoretical contributions include: (i) Theorem 4.2, a standard local linear approximation result for smooth maps; (ii) Theorem 4.3, which proves that under mild smoothness assumptions the EMF loss approximates the original semigroup consistency objective up to o(Δt) error; and (iii) Theorem 4.1, which formalizes the non‑existence of conditional flow maps, justifying the need for the proposed surrogate.
Empirically, EMF is evaluated on three diverse domains: (a) image synthesis (FFHQ, CIFAR‑10), (b) particle‑based 3D geometry generation, and (c) functional generation (neural signed distance functions). Across all tasks, EMF achieves comparable or better sample quality than state‑of‑the‑art one‑step methods while reducing training memory consumption by roughly 45‑55 % and cutting training time by about 50 %. The x₁‑prediction variant shows faster convergence because it directly supervises the final data point, whereas the u‑prediction variant retains flexibility for arbitrary step counts. Ablation studies confirm that a modest Δt (≈0.05–0.1) balances approximation accuracy and numerical stability, and that the positive clamp (r − t − Δt)_+ is essential for preventing degenerate gradients.
In summary, EMF offers a principled, JVP‑free approach to enforce trajectory consistency in one‑step and few‑step generative models. By leveraging a linearized semigroup formulation and conditional instantaneous velocities, it provides direct long‑range supervision without sacrificing computational efficiency. The paper opens avenues for extending the method to non‑Euclidean data (e.g., point clouds, graphs), adaptive step‑size schedules, and higher‑order approximations, potentially broadening the impact of efficient, consistent generative modeling.
Comments & Academic Discussion
Loading comments...
Leave a Comment