Rectified Decoupled Dataset Distillation: A Closer Look for Fair and Comprehensive Evaluation
Dataset distillation aims to generate compact synthetic datasets that enable models trained on them to achieve performance comparable to those trained on full real datasets, while substantially reducing storage and computational costs. Early bi-level optimization methods (e.g., MTT) have shown promising results on small-scale datasets, but their scalability is limited by high computational overhead. To address this limitation, recent decoupled dataset distillation methods (e.g., SRe$^2$L) separate the teacher model pre-training from the synthetic data generation process. These methods also introduce random data augmentation and epoch-wise soft labels during the post-evaluation phase to improve performance and generalization. However, existing decoupled distillation methods suffer from inconsistent post-evaluation protocols, which hinders progress in the field. In this work, we propose Rectified Decoupled Dataset Distillation (RD$^3$), and systematically investigate how different post-evaluation settings affect test accuracy. We further examine whether the reported performance differences across existing methods reflect true methodological advances or stem from discrepancies in evaluation procedures. Our analysis reveals that much of the performance variation can be attributed to inconsistent evaluation rather than differences in the intrinsic quality of the synthetic data. In addition, we identify general strategies that improve the effectiveness of distilled datasets across settings. By establishing a standardized benchmark and rigorous evaluation protocol, RD$^3$ provides a foundation for fair and reproducible comparisons in future dataset distillation research.
💡 Research Summary
Dataset distillation aims to replace large training corpora with a tiny set of synthetic examples that preserve most of the original information. Early bi‑level approaches (e.g., MTT, TESLA) achieved impressive results on small benchmarks such as CIFAR‑10/100, but their nested optimization made them prohibitively expensive for large‑scale datasets like ImageNet‑1K. Recent “decoupled” methods—SRe²L, CD‑A, DW, G‑VBSM, EDC, Minimax, D⁴M, RDED, FocusDD, DPS, etc.—address this scalability issue by pre‑training a teacher model (classifier or diffusion generator) and then generating synthetic data without further teacher updates. To compensate for the reduced supervision, these works typically employ epoch‑wise soft labels and various data‑augmentation pipelines during the post‑evaluation phase.
The authors of this paper identify a critical obstacle: inconsistent post‑evaluation protocols. Different papers use different batch sizes, training epochs, learning‑rate schedules, augmentation strengths, and soft‑label generation strategies. Because dataset distillation is highly sensitive to these hyper‑parameters, reported performance gaps (sometimes exceeding 27 % relative improvement) are difficult to attribute to the core distillation algorithm itself.
To resolve this, the authors propose Rectified Decoupled Dataset Distillation (RD³), a unified benchmark that standardizes every aspect of the post‑evaluation stage. The key components of RD³ are:
- Standardized post‑evaluation settings – a single pre‑trained ResNet‑18 teacher, KL‑divergence based soft labels, batch size = 256, training epochs = 400, and a fixed augmentation pipeline (random crop, color jitter, etc.).
- Comprehensive dataset coverage – six image datasets (CIFAR‑10/100, TinyImageNet, ImageNette, ImageWoof, ImageNet‑1K) evaluated across a wide range of IPC values (1–100).
- Cross‑architecture generalization – evaluation models include ResNet‑18/50/101, EfficientNet, MobileNet, Swin‑Transformer‑T, and ViT‑B, allowing assessment of how well synthetic data transfer to unseen architectures.
- Additional metrics – beyond top‑1 accuracy, the benchmark records training time, memory consumption, labeling cost, and energy usage.
- Open‑source reproducibility – all code, hyper‑parameters, and trained synthetic sets are released on GitHub.
Using RD³, the authors re‑implemented and re‑evaluated ten representative decoupled methods. Under the unified protocol, the previously reported 27 % performance spread collapses to roughly 6–7 % absolute difference. This demonstrates that most of the claimed gains stem from implementation tricks (larger batch, longer training, stronger augmentations, more sophisticated soft‑label schemes) rather than genuine improvements in the synthetic data itself.
The paper also uncovers several simple yet powerful factors that consistently affect performance:
- Batch size: increasing from 64 to 256 yields a 2–3 % boost in accuracy across methods.
- Training epochs: extending from 300 to 400 epochs aligns convergence without overfitting, providing a stable comparison point.
- Epoch‑wise soft labels: applying KL‑based soft targets from a fixed teacher each epoch improves large‑IPC regimes by 1–2 %.
- Ensemble teachers: aggregating predictions from multiple pre‑trained models reduces label noise and improves generalization (as seen in G‑VBSM and EDC).
- Initialization: starting optimization‑based synthesis from real images (instead of random noise) enhances diversity and speeds up convergence.
The authors further propose practical recommendations: use real‑data initialization for optimization‑based methods, match full feature distributions rather than only batch‑norm statistics, and combine diffusion‑based generation with textual prompts to handle out‑of‑distribution classes.
In conclusion, RD³ establishes a fair, reproducible, and extensible evaluation framework for decoupled dataset distillation. By stripping away confounding implementation variables, it enables the community to focus on genuine algorithmic advances. Future work can build upon this benchmark to explore new synthetic data generation techniques, efficiency‑oriented objectives, and broader downstream tasks, confident that reported improvements reflect true methodological progress rather than hidden evaluation tricks.
Comments & Academic Discussion
Loading comments...
Leave a Comment