Awesome
Soft Label Pruning for Large-scale Dataset Distillation (<ins>LPLD</ins>)
[Paper
| BibTex
| Google Drive
]
Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?", published at NeurIPS'24.
<div align=left> <img style="width:100%" src="https://github.com/ArmandXiao/Public-Large-Files/blob/0ae81e632661c8507c8d377c0a14080439a1b25e/NeurIPS24_LPLD_animation.gif"> </div> > Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).Abstract: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.
Installation
Donwload repo:
git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD
cd LPLD
Create pytorch environment:
conda env create -f environment.yml
conda activate lpld
Download all datasets and labels
Method 1: Automatic Downloading
# sh download.sh [true|false]
sh download.sh false
true|false
meaning whether to download only 40x compressed labels or all labels. (default: false, download all labels)
Method 2: Manual Downloading
Download manually from Google Drive, and place downloaded files in the following structure:
.
├── README.md
├── recover
│ └── model_with_class_bn
│ └── [put Models-with-Class-BN here]
│ └── validate_result
│ └── [put Distilled-Datast here]
├── relabel_and_validate
│ └── syn_label_LPLD
│ └── [put Labels here]
You will find following after downloading
Model with Class BN
Dataset | Model with Class BN | Size |
---|---|---|
ImageNet-1K | ResNet18 | 50.41 MB |
Tiny-ImageNet | ResNet18 | 81.30 MB |
ImageNet-21K | ResNet18 | 445.87 MB |
Distilled Image Dataset
Dataset | Setting | Dataset Size |
---|---|---|
ImageNet-1K | IPC10<br>IPC20<br>IPC50<br>IPC100<br>IPC200 | 0.15 GB<br>0.30 GB<br>0.75 GB<br>1.49 GB<br>2.98 GB |
Tiny-ImageNet | IPC50<br>IPC100 | 21 MB<br>40 MB |
ImageNet-21K | IPC10<br>IPC20 | 3 GB<br>5 GB |
Previous Soft Labels vs Ours
Dataset | Setting | Previous<br>Label Size | Previous<br>Model Acc. | Ours<br>Label Size | Ours<br>Model Acc. |
---|---|---|---|---|---|
ImageNet-1K | IP10<br>IP20<br>IPC50<br>IPC100<br>IPC200 | 5.67 GB<br>11.33 GB<br>28.33 GB<br>56.66 GB<br>113.33 GB | 20.1%<br>33.6%<br>46.8%<br>52.8%<br>57.0% | 0.14 GB (40x)<br>0.29 GB (40x)<br>0.71 GB (40x)<br>1.43 GB (40x)<br>2.85 GB (40x) | 20.2%<br>33.0%<br>46.7%<br>54.0%<br>59.6% |
Tiny-ImageNet | IPC50<br>IPC100 | 449 MB<br>898 MB <br> | 41.1%<br>49.7% | 11 MB (40x)<br>22 MB (40x)<br> | 38.4%<br>46.1% |
ImageNet-21K | IPC10<br>IPC20 | 643 GB<br>1286 GB<br> | 18.5%<br>20.5% | 16 GB (40x)<br>32 GB (40x) | 21.3%<br>29.4% |
- full labels for ImageNet-21K are too large to upload; nevertheless, we provide the 40x pruned labels.
- labels for other compression ratios are provided in google drive, or refer README: Usage to generate the labels.
Necessary Modification for Pytorch
Modify PyTorch source code torch.utils.data._utils.fetch._MapDatasetFetcher
to support multi-processing loading of soft label data and mix configurations.
class _MapDatasetFetcher(_BaseDatasetFetcher):
def fetch(self, possibly_batched_index):
if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:
pass # G_VBSM: uses self-decoding in the training script
elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:
mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])
else:
mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])
if self.auto_collation:
if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
data = self.dataset.__getitems__(possibly_batched_index)
else:
data = [self.dataset[idx] for idx in possibly_batched_index]
else:
data = self.dataset[possibly_batched_index]
if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
# NOTE: mix_index, mix_lam, mix_bbox can be None
mix_index_cpu = mix_index.cpu() if mix_index is not None else None
return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()
else:
return self.collate_fn(data)
Reproduce Results for 40x compression ratio
To reproduce the [Table
] for 40x compression ratio, run the following code:
cd relabel_and_validate
bash scripts/reproduce/main_table_in1k.sh
bash scripts/reproduce/main_table_tiny.sh
bash scripts/reproduce/main_table_in21k.sh
NOTE: validation directory (val_dir
) in config files (relabel_and_validate/cfg/reproduce/CONFIG_FILE
) should be changed to correct path on your device.
To Reproduce Results for other compression ratios
Please refer to README: Usage for details, including three modules.
Table Results (Google Drive)
No. | Content | Datasets |
---|---|---|
Table 1 | Dataset Analysis | ImageNet-1K |
Table 2 | (a) SOTA Comparison<br>(b) Large Networks | Tiny ImageNet |
Table 3 | SOTA Comparison | ImageNet-1K |
Table 4 | Ablation Study | ImageNet-1K |
Table 5 | (a) Pruning Metrics<br>(b) Calibration | ImageNet-1K |
Table 6 | (a) Large Pruning Ratio<br>(b) ResNet-50 Result<br>(c) Cross Architecture Result | ImageNet-1K |
Table 7 | SOTA Comparison | ImageNet-21K |
Table 8 | Adaptation to Optimization-free Method (i.e., RDED) | ImageNet-1K |
Table 9 | Comparison to G-VBSM | ImageNet-1K |
Appendix | ||
Table 10-18 | Configurations | - |
Table 19 | Detailed Ablation | ImageNet-1K |
Table 20 | Large IPCs (i.e., IPC300 and IPC400) | ImageNet-1K |
Table 23 | Comparison to FKD | ImageNet-1K |
Related Repos
Our code is mainly related to the following papers and repos:
- Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective
- ImageNet-21K Pretraining for the Masses
Citation
@inproceedings{xiao2024lpld,
title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},
author={Lingao Xiao and Yang He},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}