Awesome
Sequential Subset Matching for Dataset Distillation
The code is for training expert trajectories and distilling synthetic data from our Sequential Subset Matching for Dataset Distillation paper (NIPS 2023). We provide the Sequential Subset Matching methods in MTT model and IDC model
Getting Started
Download
git clone https://github.com/shqii1j/seqmatch.git
cd seqmatch
The requirement for SeqMatch in MTT model
If you have an RTX 30XX GPU (or newer), run
conda env create -f requirements_11_3.yaml
If you have an RTX 20XX GPU (or older), run
conda env create -f requirements_10_2.yaml
You can then activate your conda environment with
conda activate distillation
The requirement for SeqMatch in IDC model
If you have created distillation environment, run it
conda activate distillation
If not, install the packages pytorch
and efficientnet_pytorch
Sequential Subset Matching in MTT
There is an example .sh file to use our code. This command will generate 3 subsets to distill CIFAR-10 50 image per class:
cd seqmatch-mtt
python run.sh
Using buffer.py
, you can generate some expert trajectories for the first subset. The following command will train 100 ConvNet models on CIFAR-10 with ZCA whitening for 50 epochs each:
python buffer.py --dataset=CIFAR10 --model=ConvNet --train_epochs=50 --num_experts=100 --zca --buffer_path={path_to_buffer_storage} --data_path={path_to_dataset}
Using distill_eval_new.py
, you can generate the first subset via the buffers. The following command will generate the first distilled subset for CIFAR-10 down to just 1 image per class:
python distill.py --dataset=CIFAR10 --ipc=1 --syn_steps=30 --expert_epochs=2 --max_start_epoch=20 --zca --lr_img=100 --lr_lr=1e-05 --lr_teacher=0.01 --buffer_path={path_to_buffer_storage} --data_path={path_to_dataset} --run_name={path to the task} --name={path to the subset}
For the following subsets, you need to use --pre_names
and --reparam_syn
flags. The following command will generate the expert trajectories based on the previous subsets and the new distilled subset for CIFAR-10:
python buffer.py --dataset=CIFAR10 --model=ConvNet --train_epochs=20 --num_experts=100 --zca --image_path=logged_files --data_path={path_to_dataset} --run_name={path to the task} --pre_names={paths to the previous subsets} --reparam_syn
python distill_eval_new.py --dataset=CIFAR10 --ipc=1 --syn_steps=30 --expert_epochs=2 --zca --image_path=logged_files --data_path={path_to_dataset} --buffer_path=./logged_files/CIFAR10/{path to the task}/{paths to the last subset}/buffer --intervals=0-20 --lr_img=100 --lr_lr=1e-05 --lr_teacher=0.01 --run_name={path to the task} --pre_names={paths to the previous subsets} --name={path to the new subset} --reparam_syn
Please find a full list of hyper-parameters in our paper (https://arxiv.org/abs/2311.01570).
ImageNet
When generating expert trajectories with buffer.py
or distilling the dataset with distill.py
for ImageNet, you must designate a named subset of ImageNet with the --subset
flag.
Sequential Subset Matching in IDC
There is an example .sh file to use our code. This command will generate 2 subsets to distill CIFAR-10 50 image per class (25 image per class in one subset):
cd seqmatch-idc
python condense_new.py --reproduce -d cifar10 -f 2 --ipcs=[25,25] --inner_loop=[50,100] --niters=[2000,4000] --lrs_img_ori=[5e-3,5e-3] --it_log=100 --it_eval=100 --seed=2023 --fix_iter=50
You can get more robust result (repeat test 5 times) via th following code:
python test.py -d cifar10 -n convnet -f 2 --reproduce --ipcs=[5,5] --repeat=5 --seed=2023 --data_path={path to the results} --test_paths={paths to the subsets}
Reference
If you find our code useful for your research, please cite our paper.
@inproceedings{
du2023sequential,
title={Sequential Subset Matching for Dataset Distillation},
author={Jiawei Du and Qin Shi and Joey Tianyi Zhou},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}