Home

Awesome

Multisize Dataset Condensation

[Paper] | [BibTeX]

Official PyTorch implementation of "Multisize Dataset Condensation", published at ICLR'24 (Oral)

<img src="images/teaser.png" alt="Alt text" style="width: 640px; height: 320px; margin-right: 10px;">

Abstract While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.

Code

Key files:

Key functions (reg_ipcx.py):

Paper FunctionFunction Name
Feature Distance Calculationdef feat_loss_for_ipc_reg():
Feature Distance Comparisondef select_reg_ipc():
MLS Freezing Judgementdef get_freeze_ipc():

Basic Usage

Installation

Download repo:

git clone https://github.com/he-y/Multisize-Dataset-Condensation MDC
cd MDC

Create pytorch environment:

conda env create -f environment.yaml
conda activate mdc

Condensing

MDC Condense:

python condense_reg.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True

# Example on CIFAR-10, IPC10
python condense_reg.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True

Parallel running on different classes is also implemented. (See Appendix B.6 to see the accuracy is stable after this parallel running)

python condense_reg_mp.py  --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True --nclass_sub [NUM_SUB_CLASS] --phase [PHASE_ID]

# Example on CIFAR-10, IPC10, two jobs separatly condense class 1-5 and 6-10 
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 0 &
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 1 &

Testing

To evaluate a condensed dataset, run:

python test.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --test_type [CHOICES] --test_data_dir [PATH_TO_CONDENSED_DATA_DIR] --ipcy [IPCY]

# Example of evaluating the performance of IPC5 from CIFAR-10, IPC10 (repeating 3 times).
python test.py --reproduce -d cifar10 -f 2 --ipc 10 --test_type cx_cy --test_data_dir ./path_to_ipc10_data --ipcy 5 --repeat 3
Test TypesExplaination
other(default) evaluate the condensed dataset
cx_cychoose IPC[Y] images from total IPC images
e.g., choose IPC5 from IPC10
baseline_bconcatenate all IPC[1] images to form a IPC[N] dataset

Table Results (Google Drive)

The condensed data used in our experiments can be downloaded from google drive, including:

No.ContentDatasetsMethods
Table 1Baseline ComparisonSVHN<br>CIFAR-10<br>CIFAR-100<br>ImageNet-10Baseline A<br>Baseline B<br>Baseline C<br>MDC
Table 2SOTA ComparisonCIFAR-10<br>CIFAR-100DC (ICLR'21)<br>DSA (ICML'21)<br>MTT (CVPR'22)<br>IDC (ICML'22)<br>DREAM (ICCV'23)<br>MDC (Ours)
Table 3Ablation study on three components: Calculate, Compare, FreezeCIFAR-10MDC
Table 4Cross Architecture PerformanceCIFAR-10Baseline A<br>Baseline B<br>Baseline C<br>MDC<br>(ResNet, DesNet)
Table 5Evaluation Metric Comparison on three metrics: <br>Gradient Distance, Feature Distance, Accuracy DifferenceCIFAR-10MDC
Table 6Effects of different condensation runsCIFAR-10MDC
Appendix
Table 7Feature Distance(Skipping)
Table 8MDC on DREAMCIFAR-10IDC (ICML'22)<br>DREAM (ICCV'23)
Table 9Primary Result with Std.See Table 1<br>(logs xxx.txt)
Table 10Details of condensation run (58.37), e.g., per 100 step performanceCIFAR-10MDC
Table 11Details of condensation run (59.55), e.g., per 100 step performanceCIFAR-10MDC
Table 12Class-wise MDCCIFAR-10MDC

Related Repos

Our code is mainly developed on following papers and repos:

Citation

@inproceedings{he2024multisize,
  title={Multisize Dataset Condensation},
  author={He, Yang and Xiao, Lingao and Zhou, Joey Tianyi and Tsang, Ivor},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024}
}