Label-Consistent Dataset Distillation with Detector-Guided Refinement
Dataset distillation (DD) aims to generate a compact yet informative dataset that achieves performance comparable to the original dataset, thereby reducing demands on storage and computational resources. Although diffusion models have made significant progress in dataset distillation, the generated surrogate datasets often contain samples with label inconsistencies or insufficient structural detail, leading to suboptimal downstream performance. To address these issues, we propose a detector-guided dataset distillation framework that explicitly leverages a pre-trained detector to identify and refine anomalous synthetic samples, thereby ensuring label consistency and improving image quality. Specifically, a detector model trained on the original dataset is employed to identify anomalous images exhibiting label mismatches or low classification confidence. For each defective image, multiple candidates are generated using a pre-trained diffusion model conditioned on the corresponding image prototype and label. The optimal candidate is then selected by jointly considering the detector’s confidence score and dissimilarity to existing qualified synthetic samples, thereby ensuring both label accuracy and intra-class diversity. Experimental results demonstrate that our method can synthesize high-quality representative images with richer details, achieving state-of-the-art performance on the validation set.
💡 Research Summary
The paper tackles two persistent problems in dataset distillation (DD): label noise and insufficient structural detail in the synthetic data generated by recent diffusion‑based methods. While diffusion models have enabled high‑resolution image synthesis, they still produce a non‑trivial fraction of mislabeled or low‑confidence samples, which degrades downstream model performance. To remedy this, the authors propose a detector‑guided DD framework that explicitly detects and refines anomalous synthetic images, ensuring label consistency and improving intra‑class diversity.
The framework consists of two main modules. First, a prototype‑guided image synthesis stage uses a pre‑trained encoder to extract latent features from the original dataset, clusters them per class with K‑means, and treats the cluster centroids as class prototypes. These prototypes, together with CLIP‑derived text embeddings of class labels, condition a Latent Diffusion Model (LDM) to generate an initial set of synthetic images. Because generation is conditioned on a concrete prototype rather than pure random noise, the same prototype can be re‑used to produce multiple variants, which is essential for later refinement.
Second, an anomaly detection and refinement stage employs a classifier (the “detector”) that has been trained on the original data using CutMix augmentation to make it robust to mixed‑label inputs. The detector evaluates each synthetic image, producing a predicted label and a softmax confidence score. An image is flagged as defective if its predicted label differs from the intended class or if its confidence falls below a threshold β. For each defective sample, the system re‑conditions the LDM on the same prototype and label to generate a set of candidate images. These candidates are ranked by detector confidence; the top‑k candidates are examined for diversity by measuring feature‑space distance (e.g., cosine distance) to the already accepted normal samples. The candidate that maximizes this distance is selected, thereby preserving label correctness while encouraging intra‑class variety.
Algorithm 1 formalizes the entire pipeline: (1) extract features and compute class prototypes; (2) synthesize initial images via the LDM; (3) run the detector to separate normal from anomalous samples; (4) for each anomaly, generate multiple candidates, filter by confidence, and pick the most diverse one; (5) aggregate the refined set as the final distilled dataset.
Empirical evaluation on CIFAR‑10, CIFAR‑100, and ImageNet‑1K demonstrates substantial gains. Compared with prior diffusion‑based distillation methods such as D4M and Stable Diffusion, the proposed approach reduces label error rates from roughly 12 % to under 2 % and improves validation accuracy by 2–5 % absolute across various images‑per‑class (IPC) settings. Ablation studies show that both the detector‑based filtering and the diversity‑aware candidate selection contribute meaningfully to performance. The additional computational cost of generating multiple candidates is modest because the same prototype is reused, and detector inference is lightweight.
In summary, the paper introduces a practical, largely automated method for producing high‑quality distilled datasets. By integrating a pre‑trained detector for anomaly detection with prototype‑conditioned diffusion generation and a diversity‑aware selection strategy, it simultaneously addresses label consistency and structural richness—two critical bottlenecks in modern DD. This work opens avenues for more reliable DD in resource‑constrained environments, continual learning, and privacy‑preserving scenarios.
Comments & Academic Discussion
Loading comments...
Leave a Comment