Home

Awesome

Efficient-Dataset-Condensation

Official PyTorch implementation of "Dataset Condensation via Efficient Synthetic-Data Parameterization", published at ICML'22

image samples

Abstract The great success of machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning. Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset. However, the existing approaches have fundamental limitations in optimization due to the limited representability of synthetic datasets without considering any data regularity characteristics. To this end, we propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity. We further analyze the shortcomings of the existing gradient matching-based condensation methods and develop an effective optimization technique for improving the condensation of training data information. We propose a unified algorithm that drastically improves the quality of condensed data against the current state-of-the-art on CIFAR-10, ImageNet, and Speech Commands.

Basic results (Data Drive)

Top-1 test accuracies with ConvNet-3 (ResNetAP-10 for ImageNet)

MethodCIFAR-10SVHNMNISTFashionMNIST
IDC-I36.746.788.970.7
IDC50.668.594.281.0
MethodCIFAR-10CIFAR-100SVHNMNISTFashionMNISTImageNet-10ImageNet-100
IDC-I58.336.677.098.085.361.429.2
IDC67.545.187.598.486.072.846.7
MethodCIFAR-100ImageNet-10ImageNet-100
IDC-I41.565.534.5
IDC49.076.653.7
MethodCIFAR-10SVHNMNISTFashionMNIST
IDC-I69.587.998.889.1
IDC74.590.199.186.2

Requirements

Updates

Test Condensed Data

Download data

You can download condensed data evaluated in our paper from Here.

Training neural networks on condensed data

Then run the following codes:

python test.py -d [dataset] -n [network] -f [factor] --ipc [image/class] --repeat [#repetition]

As an example,

With 10 images/class condensed data, the top-1 test accuracies of ConvNet-3 (ResNetAP-10 for ImageNet) are about

MethodCIFAR-10CIFAR-100SVHNMNISTFashionMNISTImageNet-10ImageNet-100
IDC-I58.336.677.098.085.361.429.2
IDC67.545.187.598.486.072.846.7

You can also test other condensed methods by setting -s [dsa, kip, random, herding]

Optimize Condensed Data

To reproduce our condensed data (except for ImageNet-100), simply run

python condense.py --reproduce  -d [dataset] -f [factor] --ipc [image/class]

Faster optimization

  1. Utilizing pretrained networks

    • To train pretrained networks (which were used in condensation stage), run
    python pretrain.py -d imagenet --nclass 100 -n resnet_ap --pt_from [pretrain epochs] --seed [seed]
    
    • In our ImageNet-100 experiments, we used --pt_from 5 and train networks with 10 random seeds.
    • For ImageNet-10, --pt_from 10 will be good.
  2. Multi-processing

    • We partition the classes and do condensation with multiple processors (condense_mp.py).
    • --nclass_sub means the number of classes per partition and --phase indicates the partition number.

To sum up, after saving the pretrained models, run

python condense_mp.py --reproduce  -d imagenet --nclass 100 --pt_from 5 -f [factor] --ipc [image/class] --nclass_sub 20 --phase [0,1,2,3,4]

Train Networks on Original Training Set

python train.py -d [dataset] -n [network]

Citation

@inproceedings{kimICML22,
title = {Dataset Condensation via Efficient Synthetic-Data Parameterization},
author = {Kim, Jang-Hyun and Kim, Jinuk and Oh, Seong Joon and Yun, Sangdoo and Song, Hwanjun and Jeong, Joonhyun and Ha, Jung-Woo and Song, Hyun Oh},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2022}
}