On the Design of One-step Diffusion via Shortcutting Flow Paths
Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (\emph{a.k.a.} shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting with one step generation, and further reaches FID50k of 2.53 with 2x training steps. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.
💡 Research Summary
This paper tackles the problem of accelerating diffusion models by “shortcutting” the probabilistic flow that normally requires dozens or hundreds of neural function evaluations (NFEs). The authors observe that recent one‑step diffusion models trained from scratch—such as Consistency Training (CT), Inductive Moment Matching (IMM), Shortcut Diffusion (SCD), sCT, and MeanFlow—share a common underlying principle: they approximate a two‑step flow map (X_{s,r}\circ X_{t,s}) with a single learned flow map (X_{\theta}^{t,r}). However, prior work presents the theory and implementation tightly coupled, making it difficult to see the design space and to experiment with component‑level changes.
Core Contributions
-
Unified Design Framework – The paper formalizes the flow map (X_{t,r}) as the solution of the probability‑flow ODE (\dot{x}t = v_t(x_t)). It then defines a generic training objective: minimize the expected distance between the one‑step prediction (X{\theta}^{t,r}(x_t)) and the stop‑gradient two‑step target (\operatorname{sg}(X_{s,r}\circ X_{t,s}(x_t))). The distance can be LPIPS, squared‑L2, or a grouped kernel (MMD). This formulation subsumes all existing shortcut models.
-
Decomposition of Design Choices – The authors split a shortcut model into four orthogonal modules:
- Time Sampler (how (r\le s\le t) are drawn). They compare non‑uniform curricula (CT), log‑scale uniform sampling (SCD), fixed gaps (IMM), and the continuous‑time limit where (s\to t).
- Network Parameterization (predict instantaneous velocity (v_\theta) or average velocity (u_\theta)). The former uses a DDIM‑style solver; the latter directly integrates via a closed‑form expression.
- Flow‑Map Solver (DDIM approximation vs. analytical average‑velocity integration).
- Loss Metric (LPIPS for perceptual quality, L2 for analytic simplicity, MMD for diversity in IMM).
Extensive ablations show how each choice impacts convergence speed, stability, and final FID.
-
Three Practical Improvements –
- Plug‑in Velocity with Class‑Free Guidance: The model learns to correct a guided velocity (\gamma v_t) during training, reducing sensitivity to the guidance scale.
- Gradual Time Sampler: Starts with coarse time intervals and progressively refines them, which stabilizes training especially for continuous‑time models where (\Delta t) becomes infinitesimal.
- Variational Adaptive Loss Weighting: Dynamically scales the loss over time, damping large early‑stage gradients and allowing finer adjustments later.
Experimental Validation
Using the proposed framework and improvements, the authors train a continuous‑time shortcut model on ImageNet‑256×256 without any pre‑training, teacher distillation, or curriculum learning. The model achieves a one‑step FID₅₀k of 2.85, and when trained for twice as many steps the FID improves to 2.53. These numbers surpass prior shortcut models (CT ≈ 3.2, SCD ≈ 3.0) and approach the performance of multi‑step diffusion models that require many NFEs. The paper also reports faster convergence, stable training curves, and ablations confirming each of the three improvements contributes positively.
Limitations & Future Work
The study focuses on image synthesis; extending the framework to text‑to‑image, video, or 3D generation remains open. Multi‑class or more complex conditional guidance may require additional modifications to the plug‑in velocity scheme. Finally, the authors note that model size and hardware efficiency were not the primary focus, suggesting future research on lightweight architectures and hardware‑aware implementations.
Overall, the paper delivers a clear theoretical grounding for shortcut diffusion, disentangles the design space into reusable components, and demonstrates that careful component‑level engineering can yield state‑of‑the‑art one‑step diffusion without the heavy overhead of teacher‑based distillation.
Comments & Academic Discussion
Loading comments...
Leave a Comment