Awesome
Multisize Dataset Condensation
Official PyTorch implementation of "Multisize Dataset Condensation", published at ICLR'24 (Oral)
<img src="images/teaser.png" alt="Alt text" style="width: 640px; height: 320px; margin-right: 10px;">Abstract While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to mitigate the "subset degradation problem". Our MDC method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including SVHN, CIFAR-10, CIFAR-100 and ImageNet. For example, we achieved 5.22%-6.40% average accuracy gains on condensing CIFAR-10 to ten images per class.
Code
Key files:
condense_reg.py
: main file of the condensation process.reg_ipcx.py
: helper class (class Regularizer) and functions to maintain and update the most learnable subset (MLS).
Key functions (reg_ipcx.py
):
Paper Function | Function Name |
---|---|
Feature Distance Calculation | def feat_loss_for_ipc_reg(): |
Feature Distance Comparison | def select_reg_ipc(): |
MLS Freezing Judgement | def get_freeze_ipc(): |
Basic Usage
Installation
Download repo:
git clone https://github.com/he-y/Multisize-Dataset-Condensation MDC
cd MDC
Create pytorch environment:
conda env create -f environment.yaml
conda activate mdc
Condensing
MDC Condense:
python condense_reg.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True
# Example on CIFAR-10, IPC10
python condense_reg.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True
Parallel running on different classes is also implemented. (See Appendix B.6 to see the accuracy is stable after this parallel running)
python condense_reg_mp.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --adaptive_reg True --nclass_sub [NUM_SUB_CLASS] --phase [PHASE_ID]
# Example on CIFAR-10, IPC10, two jobs separatly condense class 1-5 and 6-10
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 0 &
python condense_reg_mp.py --reproduce -d cifar10 -f 2 --ipc 10 --adaptive_reg True --nclass_sub 5 --phase 1 &
Testing
To evaluate a condensed dataset, run:
python test.py --reproduce -d [DATASET] -f [FACTOR] --ipc [IPC] --test_type [CHOICES] --test_data_dir [PATH_TO_CONDENSED_DATA_DIR] --ipcy [IPCY]
# Example of evaluating the performance of IPC5 from CIFAR-10, IPC10 (repeating 3 times).
python test.py --reproduce -d cifar10 -f 2 --ipc 10 --test_type cx_cy --test_data_dir ./path_to_ipc10_data --ipcy 5 --repeat 3
Test Types | Explaination |
---|---|
other | (default) evaluate the condensed dataset |
cx_cy | choose IPC[Y] images from total IPC images |
e.g., choose IPC5 from IPC10 | |
baseline_b | concatenate all IPC[1] images to form a IPC[N] dataset |
Table Results (Google Drive)
The condensed data used in our experiments can be downloaded from google drive, including:
No. | Content | Datasets | Methods |
---|---|---|---|
Table 1 | Baseline Comparison | SVHN<br>CIFAR-10<br>CIFAR-100<br>ImageNet-10 | Baseline A<br>Baseline B<br>Baseline C<br>MDC |
Table 2 | SOTA Comparison | CIFAR-10<br>CIFAR-100 | DC (ICLR'21)<br>DSA (ICML'21)<br>MTT (CVPR'22)<br>IDC (ICML'22)<br>DREAM (ICCV'23)<br>MDC (Ours) |
Table 3 | Ablation study on three components: Calculate, Compare, Freeze | CIFAR-10 | MDC |
Table 4 | Cross Architecture Performance | CIFAR-10 | Baseline A<br>Baseline B<br>Baseline C<br>MDC<br>(ResNet, DesNet) |
Table 5 | Evaluation Metric Comparison on three metrics: <br>Gradient Distance, Feature Distance, Accuracy Difference | CIFAR-10 | MDC |
Table 6 | Effects of different condensation runs | CIFAR-10 | MDC |
Appendix | |||
Table 7 | Feature Distance | (Skipping) | |
Table 8 | MDC on DREAM | CIFAR-10 | IDC (ICML'22)<br>DREAM (ICCV'23) |
Table 9 | Primary Result with Std. | See Table 1<br>(logs xxx.txt ) | |
Table 10 | Details of condensation run (58.37), e.g., per 100 step performance | CIFAR-10 | MDC |
Table 11 | Details of condensation run (59.55), e.g., per 100 step performance | CIFAR-10 | MDC |
Table 12 | Class-wise MDC | CIFAR-10 | MDC |
Related Repos
Our code is mainly developed on following papers and repos:
- Dataset Condensation via Efficient Synthetic-Data Parameterization: [Paper], [Code]
- DREAM: Efficient Dataset Distillation by Representative Matching: [Paper], [Code]
- Dataset Condensation with Gradient Matching: [Paper], [Code]
Citation
@inproceedings{he2024multisize,
title={Multisize Dataset Condensation},
author={He, Yang and Xiao, Lingao and Zhou, Joey Tianyi and Tsang, Ivor},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}