Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis
Distribution matching distillation (DMD) aligns a multi-step generator with its few-step counterpart to enable high-quality generation under low inference cost. However, DMD tends to suffer from mode collapse, as its reverse-KL formulation inherently encourages mode-seeking behavior, for which existing remedies typically rely on perceptual or adversarial regularization, thereby incurring substantial computational overhead and training instability. In this work, we propose a role-separated distillation framework that explicitly disentangles the roles of distilled steps: the first step is dedicated to preserving sample diversity via a target-prediction (e.g., v-prediction) objective, while subsequent steps focus on quality refinement under the standard DMD loss, with gradients from the DMD objective blocked at the first step. We term this approach Diversity-Preserved DMD (DP-DMD), which, despite its simplicity – no perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images – preserves sample diversity while maintaining visual quality on par with state-of-the-art methods in extensive text-to-image experiments.
💡 Research Summary
The paper addresses a critical limitation of Distribution Matching Distillation (DMD), a recent technique for accelerating diffusion‑based generative models. While DMD aligns the output distribution of a few‑step student model with that of a high‑quality multi‑step teacher, its reverse‑KL objective inherently encourages mode‑seeking behavior, leading to severe sample‑diversity collapse. Existing remedies add perceptual losses or adversarial discriminators, but these introduce substantial computational overhead and training instability, especially for large text‑to‑image models.
The authors propose Diversity‑Preserved DMD (DP‑DMD), a role‑separated distillation framework that explicitly assigns different objectives to different distilled steps. The first denoising step of the student is supervised with a target‑prediction (v‑prediction) loss derived from a teacher‑generated intermediate latent at a predefined noise level K. This loss is a simple L2 flow‑matching term that forces the student’s initial velocity prediction to match the teacher’s true flow, thereby preserving the global structure and diversity encoded in the early, high‑noise stage. Crucially, the output of this first step is detached (stop‑gradient) before proceeding to later steps, preventing the reverse‑KL gradients from overwriting the diversity‑preserving signal.
All subsequent steps (N‑1 of them) are trained solely with the standard DMD loss, which continues to refine fine‑grained visual details without affecting the diversity established earlier. The overall training objective is L = L_DMD + λ·L_Div, where λ balances quality refinement and diversity preservation. No perceptual backbone, discriminator, auxiliary networks, or extra ground‑truth images are required; everything operates in latent space, keeping the pipeline memory‑efficient and stable.
Experiments use two state‑of‑the‑art text‑to‑image backbones—flow‑based SD3.5‑Medium and diffusion‑based SDXL—at 1024×1024 resolution. Training is performed on the DiffusionDB dataset for 6 000 iterations with modest batch size on eight A800 GPUs. Evaluation employs DINOv3‑ViT‑Large and CLIP‑ViT‑Large embeddings to quantify diversity (pairwise cosine similarity) and standard metrics (FID, IS, CLIP‑Score) for quality.
Results show that DP‑DMD dramatically improves diversity compared to vanilla DMD when both are limited to 4 NFEs (≈4 inference steps). Diversity scores increase by roughly 15‑20 % while quality metrics remain on par with or slightly better than DMD. Ablation studies confirm that the v‑prediction loss on the first step and the gradient stop are essential; removing either causes the model to revert to mode collapse. Moreover, because DP‑DMD eliminates perceptual and adversarial components, it reduces GPU memory consumption by 30‑40 % and yields a more stable training process.
In summary, DP‑DMD offers a simple yet powerful solution to the mode‑collapse problem of distribution‑matching distillation. By leveraging the natural stage‑wise behavior of diffusion processes—early steps governing global layout and diversity, later steps refining details—it achieves fast, high‑quality image synthesis without extra computational baggage. The approach is readily applicable to other flow‑based generative tasks such as video or 3D generation, and opens avenues for further research on multi‑stage role separation and alternative distribution‑matching objectives.
Comments & Academic Discussion
Loading comments...
Leave a Comment