RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.

💡 Research Summary

RMFlow (Refined Mean Flow by a Noise‑Injection Step for Multimodal Generation) tackles the well‑known degradation of quality that occurs when MeanFlow—a recent 1‑NFE (single function evaluation) generative approach—is forced to operate with only a single evaluation step. Classical MeanFlow transports a prior sample directly to the target distribution by learning an average velocity field ˆu₀,₁(x₀;θ) and applying it once (x₁ = x₀ + ˆu₀,₁). While this yields impressive results with multiple NFEs, collapsing to 1‑NFE typically produces blurry images, invalid molecular graphs, or poor density estimates.

RMFlow augments this coarse transport with a carefully designed noise‑injection refinement. After the 1‑NFE MeanFlow step, an additional Gaussian perturbation is added:
x̂_tgt = x₀ + ˆu₀,₁(x₀;θ) + √(σ_min² – σ²)·ε₂, ε₂ ∼ N(0, I).
This step mirrors the decoder of a Variational Auto‑Encoder and enables a principled likelihood maximization. The conditional distribution of the final sample given the prior becomes a Gaussian with mean x₀ + ˆu₀,₁ and variance (σ_min² – σ²)I, allowing the authors to introduce a negative log‑likelihood loss L_NLL that directly penalizes the squared distance between the noisy target (x_data + σ_min ε) and the generated mean.

The training objective combines three components:

L_CMFM – the original Conditional Mean Flow Matching loss, which bounds the Wasserstein‑2 distance between the learned and true probability flows.
λ₁ L_NLL – the likelihood term that drives KL divergence reduction, formally proven in Theorem 4.1 to provide a lower bound on the expected log‑likelihood of the target distribution.
**λ₂ E

RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment