Awesome

Distill Gold From Massive Ores (BiLP)

This is official implementation (preview version) of ECCV'24 paper: Distill Gold from Massive Ores: Bi-Level Data Pruning towards Efficient Dataset Distillation (BiLP). This work systematically study the data redundancy in the dataset distillation. We propose multiple effective criteria for pruning, and we hope our observation, analysis and empirical results could provide deeper insight into the internal mechanism of dataset distillation and neural network training.

Updates:

[2024-07] BiLP is accepted to ECCV'24. The updated code and paper will come soon!
[2023-11] Release the preview code and paper.

Usage

Our method is neatly packaged as a plug-and-play module in just a single line of code in most algorithms. Take DC/DSA/DM as an example, to use our data pruning plugin:

Clone our repository.

git clone https://github.com/silicx/GoldFromOres.git
cd GoldFromOres

Download/Clone the code of the distillation algorithms in the folder and setup the environment. E.g.

git clone https://github.com/VICO-UoE/DatasetCondensation.git

Add this key line to apply data pruning:

images_all, labels_all, indices_class = drop_samples(
    images_all, labels_all, indices_class, 
    args.dataset, args.drop_criterion, drop_ratio=args.drop_ratio)

with these small modification:

...
import drop_samples from drop_utils
...
parser.add_argument('--drop_criterion', type=str)
parser.add_argument('--drop_ratio', type=float)
...

Here, drop_ratio is a scalar in [0.0, 1.0]. drop_criterion can be chosen among:

Drop_criterion	Description
Random	Randomly drop samples
LossConverge_large	Drop samples with large loss value after convergence
LossConverge_small	Drop samples with small loss value after convergence
LossInit_large	Drop samples with large loss value in the initial epochs
LossInit_small	Drop samples with small loss value in the initial epochs
MonteCarlo_large	Drop samples with large utility, estimated by Monte-Carlo alg.
MonteCarlo_small	Drop samples with small utility, estimated by Monte-Carlo alg.

And our empirical analysis show that dropping with criterion LossConverge_large, LossInit_large, MonteCarlo_small yields better performance.

Run the command with specified pruning criterion and ratio. e.g.

python -m DatasetCondensation.main --dataset CIFAR10 --model ConvNet --ipc 1 --drop_criterion LossInit_large --drop_ratio 0.99
python -m DatasetCondensation.main_DM --dataset CIFAR10 --model ConvNet --ipc 1 --dsa_strategy color_crop_cutout_flip_scale_rotate --init real --lr_img 1 --drop_criterion LossInit_large --drop_ratio 0.99

Note that the random pruning method on some multi-stage distillation algorithm (e.g. MTT-based) needs some adaptation to avoid using different samples in two stages.

Reference

If you find our work useful and inspiring, do not hesitate to cite:

@article{xu2023distill,
  title={Distill Gold from Massive Ores: Efficient Dataset Distillation via Critical Samples Selection},
  author={Xu, Yue and Li, Yong-Lu and Cui, Kaitong and Wang, Ziyu and Lu, Cewu and Tai, Yu-Wing and Tang, Chi-Keung},
  journal={arXiv preprint arXiv:2305.18381},
  year={2023}
}

Acknowledgement

As our method is applied across various dataset distillation algorithms, we sincerely thank all the colleagues for their dedicated contributions to the open-source community, including but not limited to:

DC/DSA/DM: Dataset Condensation with Gradient Matching, Dataset Condensation with Differentiable Siamese Augmentation, Dataset Condensation with Distribution Matching, Github
MTT: Dataset Distillation by Matching Training Trajectories, Github
CAFE: Learning to Condense Dataset by Aligning Features, Github
FRePo: Dataset Distillation using Neural Feature Regression, Github
RFAD: Efficient Dataset Distillation using Random Feature Approximation, Github
IDC: Dataset Condensation via Efficient Synthetic-Data Parameterization, Github
IDM: Improved Distribution Matching for Dataset Condensation, Github
LinBa: Remember the Past: Distilling Datasets into Addressable Memories for Neural Networks, Github
HaBa: Dataset Distillation via Factorization, Github

... as well as the Awesome project.