Dataset Distillation as Pushforward Optimal Quantization

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection distance. In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors. We propose Dataset Distillation by Optimal Quantization, based on clustering in a latent space. Compared to the previous SOTA method D\textsuperscript{4}M, we achieve better performance and inter-model generalization on the ImageNet-1K dataset with trivial additional computation, and SOTA performance in higher image-per-class settings. Using the distilled noise initializations in a stronger diffusion transformer model, we obtain SOTA distillation performance on ImageNet-1K and its subsets, outperforming diffusion guidance methods.

💡 Research Summary

The paper “Dataset Distillation as Pushforward Optimal Quantization” re‑examines dataset distillation (DD) from a theoretical perspective and proposes a new, provably optimal method that dramatically improves over recent state‑of‑the‑art (SOTA) disentangled approaches. Traditional DD is framed as a bi‑level optimization problem: an outer loop searches for a synthetic dataset S while an inner loop trains a neural network on S. This formulation is computationally prohibitive for large‑scale data such as ImageNet‑1K or ImageNet‑21K because the inner training must be back‑propagated through many gradient steps, leading to memory and time costs that grow with the size of the original dataset.

In contrast, recent “disentangled” methods avoid the inner optimization by matching statistics of the data distribution (e.g., batch‑norm moments, feature distributions) or by using generative priors (GANs, diffusion models). Although these methods scale much better, they lack a rigorous justification for why a small set of synthetic points can faithfully approximate the full data distribution.

The authors bridge this gap by showing that disentangled DD can be cast as an optimal quantization problem in a low‑dimensional latent space. An encoder‑decoder pair (E, D) maps high‑dimensional images to a d‑dimensional latent space where the original data distribution µ∈P₂(ℝᵈ) lives. The goal is to find K representative points {x₁,…,x_K} that minimize the quadratic distortion

G_K,µ(x)=E_{X∼µ}

Dataset Distillation as Pushforward Optimal Quantization

💡 Research Summary

Comments & Academic Discussion

Leave a Comment