Home

Awesome

You Only Condense Once (YOCO)

[Paper] [BibTeX]

<!-- ![Teaser Image](assets/teaser.png) --> <img src="assets/teaser.png" alt="Alt text" width="600" height="400">

On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules, Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive.

Getting Started

First, download our repo:

https://github.com/he-y/you-only-condense-once.git
cd you-only-condense-once

Second, create conda environment: The code has been tested with Pytorch 1.11.0 with Python 3.9.15.

# create conda environment
conda create -n yoco python=3.9
conda activate yoco

Third, install the required dependencies:

pip install -r requirements.txt

Our code is mainly based on two repositories:

Main Files of the Repo

Module 1: Condensed Dataset Preparation (Google Drive File)

The condensed datasets used in our experiments can be downloaded from google drive. The downloaded datasets should follow below file structure:

YOCO
- raid
  - condensed_img
    - dream
    - idc
    - ...

condense_key in below table denotes condensed datasets obtained by which method are evaluated. Our experiment results are mainly tested on IDC, so default setting is condense_key = idc.

condense_keyDescription
idcDataset Condensation via Efficient Synthetic-Data Parameterization (IDC)
dreamEfficient Dataset Distillation by Representative Matching (DREAM)
mttDataset Distillation by Matching Training Trajectories (MTT)
dsaDataset Condensation with Differentiable Siamese Augmentation (DSA)
kipDataset Distillation with Infinitely Wide Convolutional Networks (KIP)

If you want to condense by yourself, run:

python condense.py --reproduce_condense  -d [dataset] -f [factor] --ipc [images per class]
<!-- Condensed datasets of other datasets can be obtained by following below repo: - [IDC (Dataset Condensation via Efficient Synthetic-Data Parameterization)](https://github.com/snu-mllab/Efficient-Dataset-Condensation) - [DREAM (Efficient Dataset Distillation by Representative Matching)](https://github.com/lyq312318224/DREAM) - [MTT (Dataset Distillation by Matching Training Trajectories)](https://github.com/GeorgeCazenavette/mtt-distillation) - [DSA (Dataset Condensation with Differentiable Siamese Augmentation)](https://github.com/VICO-UoE/DatasetCondensation) - [KIP (Dataset Distillation with Infinitely Wide Convolutional Networks)](https://github.com/google-research/google-research/tree/master/kip) -->

Module 2: Pruning the Condensed Datasets via Three Steps (Google Drive File)

Step 1: Generate the training dyanmics from the condensed dataset (or you can directly downloaded our generated training dynamics here):

python get_training_dynamics.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 2: Generate the score file for each image according to the training dynamic:

python generate_importance_score.py --dataset [dataset] --ipc [IPCF] --condense_key [condensation method]

Step 3: Evaluate the performance using different dataset pruning metrics

python test.py -d [dataset] --ipc [IPCF] --slct_ipc [IPCT] --pruning_key [pruning method] --condense_key [condensation method]

pruning_key denotes different dataset pruning methods including:

pruning_keyDescriptionPrefer hard/easy?Balanced?
randomRandom SelectionN/Ano
sspSelf-Supervised Prototypehardno
entropyEntropyhardno
accumulated_marginArea Under the Marginhardno
forgettingForgetting scorehardno
el2nEL2N scorehardno
ccsCoverage-centric Coreset Selectioneasyno
yocoOur methodeasyyes

To alter the components for each metric, we can append following suffixes after pruning_key:

suffixexplanation
_easy / _hardWhether to use easy / hard samples
_balance / _imbalanceWhether to have balance / imbalance class distribution

For example, default forgetting metric is equivalent to forgetting_hard_imbalance, prefer hard and not balanced.

Table reproducing

For the ease of reproducing experiment results, we provide the bash shell scripts for each table. The scripts can be found in scripts\table[x].sh. The training dynamics and scores used in our experiments can be downloaded from google drive. Note: the training dynamics contains large files (e.g., idc/cifar100 is ~6GB).

The downloaded files should follow below file structure:

YOCO
- raid
  - reproduce_*
      - dynamics_and_scores
        - idc
        - dream
        - ...
  - condensed_img (download from Module 1)
    - idc
    - dream
    - ...

Citation

@inproceedings{
    heyoco2023,
    title={You Only Condense Once: Two Rules for Pruning Condensed Datasets},
    author={Yang He and Lingao Xiao and Joey Tianyi Zhou},
    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
    year={2023},
    url={https://openreview.net/forum?id=AlTyimRsLf}
}