Home

Awesome

Distill Gold From Massive Ores (BiLP)

This is official implementation (preview version) of ECCV'24 paper: Distill Gold from Massive Ores: Bi-Level Data Pruning towards Efficient Dataset Distillation (BiLP). This work systematically study the data redundancy in the dataset distillation. We propose multiple effective criteria for pruning, and we hope our observation, analysis and empirical results could provide deeper insight into the internal mechanism of dataset distillation and neural network training.

<p align="center"><img src='method.png' width=400></p>

Updates:

Usage

Our method is neatly packaged as a plug-and-play module in just a single line of code in most algorithms. Take DC/DSA/DM as an example, to use our data pruning plugin:

  1. Clone our repository.
git clone https://github.com/silicx/GoldFromOres.git
cd GoldFromOres
  1. Download/Clone the code of the distillation algorithms in the folder and setup the environment. E.g.
git clone https://github.com/VICO-UoE/DatasetCondensation.git
  1. Add this key line to apply data pruning:
images_all, labels_all, indices_class = drop_samples(
    images_all, labels_all, indices_class, 
    args.dataset, args.drop_criterion, drop_ratio=args.drop_ratio)

with these small modification:

...
import drop_samples from drop_utils
...
parser.add_argument('--drop_criterion', type=str)
parser.add_argument('--drop_ratio', type=float)
...

Here, drop_ratio is a scalar in [0.0, 1.0]. drop_criterion can be chosen among:

Drop_criterionDescription
RandomRandomly drop samples
LossConverge_largeDrop samples with large loss value after convergence
LossConverge_smallDrop samples with small loss value after convergence
LossInit_largeDrop samples with large loss value in the initial epochs
LossInit_smallDrop samples with small loss value in the initial epochs
MonteCarlo_largeDrop samples with large utility, estimated by Monte-Carlo alg.
MonteCarlo_smallDrop samples with small utility, estimated by Monte-Carlo alg.

And our empirical analysis show that dropping with criterion LossConverge_large, LossInit_large, MonteCarlo_small yields better performance.

  1. Run the command with specified pruning criterion and ratio. e.g.
python -m DatasetCondensation.main --dataset CIFAR10 --model ConvNet --ipc 1 --drop_criterion LossInit_large --drop_ratio 0.99
python -m DatasetCondensation.main_DM --dataset CIFAR10 --model ConvNet --ipc 1 --dsa_strategy color_crop_cutout_flip_scale_rotate --init real --lr_img 1 --drop_criterion LossInit_large --drop_ratio 0.99

Note that the random pruning method on some multi-stage distillation algorithm (e.g. MTT-based) needs some adaptation to avoid using different samples in two stages.

Reference

If you find our work useful and inspiring, do not hesitate to cite:

@article{xu2023distill,
  title={Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection},
  author={Xu, Yue and Li, Yong-Lu and Cui, Kaitong and Wang, Ziyu and Lu, Cewu and Tai, Yu-Wing and Tang, Chi-Keung},
  journal={arXiv preprint arXiv:2305.18381},
  year={2023}
}

Acknowledgement

As our method is applied across various dataset distillation algorithms, we sincerely thank all the colleagues for their dedicated contributions to the open-source community, including but not limited to:

... as well as the Awesome project.