Home

Awesome

Soft Label Pruning for Large-scale Dataset Distillation (<ins>LPLD</ins>)

[Paper | BibTex | Google Drive]


Official Implementation for "Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?", published at NeurIPS'24.

Lingao XiaoYang He

Abstract: In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be achieved through simple random pruning, eliminating the need for complex rule-based strategies. Experiments validate our discoveries. For example, when condensing ImageNet-1K to 200 images per class, our approach compresses the required soft labels from 113 GB to 2.8 GB (40x compression) with a 2.6% performance gain.

<div align=left> <img style="width:100%" src="https://github.com/ArmandXiao/Public-Large-Files/blob/0ae81e632661c8507c8d377c0a14080439a1b25e/NeurIPS24_LPLD_animation.gif"> </div> > Images from left to right are from IPC20 LPLD datasets: cock (left), bald eagle, volcano, trailer truck (right).

Installation

Donwload repo:

git clone https://github.com/he-y/soft-label-pruning-for-dataset-distillation.git LPLD
cd LPLD

Create pytorch environment:

conda env create -f environment.yml
conda activate lpld

Download all datasets and labels

Method 1: Automatic Downloading

# sh download.sh [true|false]
sh download.sh false

Method 2: Manual Downloading

Download manually from Google Drive, and place downloaded files in the following structure:

.
├── README.md
├── recover
│   └── model_with_class_bn
│       └── [put Models-with-Class-BN here]
│   └── validate_result
│       └── [put Distilled-Datast here]
├── relabel_and_validate
│   └── syn_label_LPLD
│       └── [put Labels here]

You will find following after downloading

Model with Class BN

DatasetModel with Class BNSize
ImageNet-1KResNet1850.41 MB
Tiny-ImageNetResNet1881.30 MB
ImageNet-21KResNet18445.87 MB

Distilled Image Dataset

DatasetSettingDataset Size
ImageNet-1KIPC10<br>IPC20<br>IPC50<br>IPC100<br>IPC2000.15 GB<br>0.30 GB<br>0.75 GB<br>1.49 GB<br>2.98 GB
Tiny-ImageNetIPC50<br>IPC10021 MB<br>40 MB
ImageNet-21KIPC10<br>IPC203 GB<br>5 GB

Previous Soft Labels vs Ours

DatasetSettingPrevious<br>Label SizePrevious<br>Model Acc.Ours<br>Label SizeOurs<br>Model Acc.
ImageNet-1KIP10<br>IP20<br>IPC50<br>IPC100<br>IPC2005.67 GB<br>11.33 GB<br>28.33 GB<br>56.66 GB<br>113.33 GB20.1%<br>33.6%<br>46.8%<br>52.8%<br>57.0%0.14 GB (40x)<br>0.29 GB (40x)<br>0.71 GB (40x)<br>1.43 GB (40x)<br>2.85 GB (40x)20.2%<br>33.0%<br>46.7%<br>54.0%<br>59.6%
Tiny-ImageNetIPC50<br>IPC100449 MB<br>898 MB <br>41.1%<br>49.7%11 MB (40x)<br>22 MB (40x)<br>38.4%<br>46.1%
ImageNet-21KIPC10<br>IPC20643 GB<br>1286 GB<br>18.5%<br>20.5%16 GB (40x)<br>32 GB (40x)21.3%<br>29.4%

Necessary Modification for Pytorch

Modify PyTorch source code torch.utils.data._utils.fetch._MapDatasetFetcher to support multi-processing loading of soft label data and mix configurations.

class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            if hasattr(self.dataset, "G_VBSM") and self.dataset.G_VBSM:
                pass # G_VBSM: uses self-decoding in the training script
            elif hasattr(self.dataset, "use_batch") and self.dataset.use_batch:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config_by_batch_idx(possibly_batched_index[0])
            else:
                mix_index, mix_lam, mix_bbox, soft_label = self.dataset.load_batch_config(possibly_batched_index[0])

        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
        else:
            data = self.dataset[possibly_batched_index]

        if hasattr(self.dataset, "mode") and self.dataset.mode == 'fkd_load':
            # NOTE: mix_index, mix_lam, mix_bbox can be None
            mix_index_cpu = mix_index.cpu() if mix_index is not None else None
            return self.collate_fn(data), mix_index_cpu, mix_lam, mix_bbox, soft_label.cpu()
        else:
            return self.collate_fn(data)

Reproduce Results for 40x compression ratio

To reproduce the [Table] for 40x compression ratio, run the following code:

cd relabel_and_validate
bash scripts/reproduce/main_table_in1k.sh
bash scripts/reproduce/main_table_tiny.sh
bash scripts/reproduce/main_table_in21k.sh

NOTE: validation directory (val_dir) in config files (relabel_and_validate/cfg/reproduce/CONFIG_FILE) should be changed to correct path on your device.

To Reproduce Results for other compression ratios

Please refer to README: Usage for details, including three modules.

Table Results (Google Drive)

No.ContentDatasets
Table 1Dataset AnalysisImageNet-1K
Table 2(a) SOTA Comparison<br>(b) Large NetworksTiny ImageNet
Table 3SOTA ComparisonImageNet-1K
Table 4Ablation StudyImageNet-1K
Table 5(a) Pruning Metrics<br>(b) CalibrationImageNet-1K
Table 6(a) Large Pruning Ratio<br>(b) ResNet-50 Result<br>(c) Cross Architecture ResultImageNet-1K
Table 7SOTA ComparisonImageNet-21K
Table 8Adaptation to Optimization-free Method (i.e., RDED)ImageNet-1K
Table 9Comparison to G-VBSMImageNet-1K
Appendix
Table 10-18Configurations-
Table 19Detailed AblationImageNet-1K
Table 20Large IPCs (i.e., IPC300 and IPC400)ImageNet-1K
Table 23Comparison to FKDImageNet-1K

Related Repos

Our code is mainly related to the following papers and repos:

Citation

@inproceedings{xiao2024lpld,
  title={Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?},
  author={Lingao Xiao and Yang He},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}