Home

Awesome

On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm

Peng Sun, Bei Shi, Daiwei Yu, Tao Lin

arXiv | BibTeX

This is an official PyTorch implementation of the paper On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm (Preprint 2023). In this work, we:

Abstract

<p align="center"> <img src="./assets/framework.png" width=100% height=100% class="center"> </p>

Contemporary machine learning requires training large neural networks on massive datasets and thus faces the challenges of high computational demands. Dataset distillation, as a recent emerging strategy, aims to compress real-world datasets for efficient training. However, this line of research currently struggle with large-scale and high-resolution datasets, hindering its practicality and feasibility. To this end, we re-examine the existing dataset distillation methods and identify three properties required for large-scale real-world applications, namely, realism, diversity, and efficiency. As a remedy, we propose RDED, a novel computationally-efficient yet effective data distillation paradigm, to enable both diversity and realism of the distilled data.

TODOs

Usage

Requirements

torchvision==0.16.0
torch==2.1.0

How to Run

The main entry point of a single experiment is main.py. To facilitate experiments running, we provide scripts for running the bulk experiments in the paper. For example, to run RDED for condensing ImageNet-1K into small dataset with $\texttt{IPC} = 10$ using ResNet-18, you can run the following command:

bash ./scripts/imagenet-1k_10ipc_resnet-18_to_resnet-18_cr5.sh

Pre-trained Models

Following SRe$^2$L, we adapt official Torchvision code to train the observer models from scratch. All our pre-trained observer models listed below are available at link.

DatasetBackboneTop1-accuracyInput Size
CIFAR10ResNet18 (modified)93.8632 $\times$ 32
CIFAR10Conv382.2432 $\times$ 32
CIFAR100ResNet18 (modified)72.2732 $\times$ 32
CIFAR100Conv361.2732 $\times$ 32
Tiny-ImageNetResNet18 (modified)61.9864 $\times$ 64
Tiny-ImageNetConv449.7364 $\times$ 64
ImageNet-NetteResNet1890.00224 $\times$ 224
ImageNet-NetteConv589.60128 $\times$ 224
ImageNet-WoofResNet1875.00224 $\times$ 224
ImageNet-WoofConv567.40128 $\times$ 128
ImageNet-10ResNet1887.40224 $\times$ 224
ImageNet-10Conv585.4128 $\times$ 128
ImageNet-100ResNet1883.40224 $\times$ 224
ImageNet-100Conv672.82128 $\times$ 128
ImageNet-1kConv443.664 $\times$ 64

Storage Format for Raw Datasets

All our raw datasets, including those like ImageNet-1K and CIFAR10, store their training and validation components in the following format to facilitate uniform reading using a standard dataset class method:

/path/to/dataset/
├── 00000/
│   ├── image1.jpg
│   ├── image2.jpg
│   ├── image3.jpg
│   ├── image4.jpg
│   └── image5.jpg
├── 00001/
│   ├── image1.jpg
│   ├── image2.jpg
│   ├── image3.jpg
│   ├── image4.jpg
│   └── image5.jpg
├── 00002/
│   ├── image1.jpg
│   ├── image2.jpg
│   ├── image3.jpg
│   ├── image4.jpg
│   └── image5.jpg

This organizational structure ensures compatibility with the unified dataset class, streamlining the process of data handling and accessibility.

Bibliography

If you find this repository helpful for your project, please consider citing our work:

@InProceedings{sun2024diversity,
  title={On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm},
  author={Sun, Peng and Shi, Bei and Yu, Daiwei and Lin, Tao},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2024}
}

Reference

Our code has referred to previous work: