Home

Awesome

Hierarchical Self-supervised Augmented Knowledge Distillation

PWC

Installation

Requirements

Ubuntu 18.04 LTS

Python 3.8 (Anaconda is recommended)

CUDA 11.1

PyTorch 1.6.0

NCCL for CUDA 11.1

Perform Offline KD experiments on CIFAR-100 dataset

Dataset

CIFAR-100 : download

unzip to the ./data folder

Training baselines

python train_baseline_cifar.py --arch wrn_16_2 --data ./data/  --gpu 0

More commands for training various architectures can be found in train_baseline_cifar.sh

Training teacher networks

(1) Use pre-trained backbone and train all auxiliary classifiers.

The pre-trained backbone weights follow .pth files downloaded from repositories of CRD and SSKD.

You should download them from Google Derive before training the teacher network that needs a pre-trained backbone

python train_teacher_cifar.py \
    --arch wrn_40_2_aux \
    --milestones 30 60 90 --epochs 100 \
    --checkpoint-dir ./checkpoint \
    --data ./data  \
    --gpu 2 --manual 0 \
    --pretrained-backbone ./pretrained_backbones/wrn_40_2.pth \
    --freezed

More commands for training various teacher networks with frozen backbones can be found in train_teacher_freezed.sh

The pre-trained teacher networks can be downloaded from Google Derive

(2) Train the backbone and all auxiliary classifiers jointly from scratch. In this case, we no longer need a pre-trained teacher backbone.

It can lead to a better accuracy for teacher backbone towards our empirical study.

python train_teacher_cifar.py \
    --arch wrn_40_2_aux \
    --checkpoint-dir ./checkpoint \
    --data ./data \
    --gpu 2 --manual 1

The pre-trained teacher networks can be downloaded from Google Derive

For differentiating (1) and (2), we use --manual 0 to indicate the case of (1) and --manual 1 to indicate the case of (2)

Training student networks

(1) train baselines of student networks

python train_baseline_cifar.py --arch wrn_16_2 --data ./data/  --gpu 0

More commands for training various teacher-student pairs can be found in train_baseline_cifar.sh

(2) train student networks with a pre-trained teacher network

Note that the specific teacher network should be pre-trained before training the student networks

python train_student_cifar.py \
    --tarch wrn_40_2_aux \
    --arch wrn_16_2_aux \
    --tcheckpoint ./checkpoint/train_teacher_cifar_arch_wrn_40_2_aux_dataset_cifar100_seed0/wrn_40_2_aux.pth.tar \
    --checkpoint-dir ./checkpoint \
    --data ./data \
    --gpu 0 --manual 0

More commands for training various teacher-student pairs can be found in train_student_cifar.sh

Results of the same architecture style between teacher and student networks

Teacher <br> StudentWRN-40-2 <br> WRN-16-2WRN-40-2 <br> WRN-40-1ResNet-56 <br> ResNet-20ResNet32x4 <br> ResNet8x4
Teacher <br> Teacher*76.45 <br> 80.7076.45 <br> 80.7073.44 <br> 77.2079.63 <br> 83.73
Student73.57±0.2371.95±0.5969.62±0.2672.95±0.24
HSAKD77.20±0.1777.00±0.2172.58±0.3377.26±0.14
HSAKD*78.67±0.2078.12±0.2573.73±0.1077.69±0.05

Results of different architecture styles between teacher and student networks

Teacher <br> StudentVGG13 <br> MobileNetV2ResNet50 <br> MobileNetV2WRN-40-2 <br> ShuffleNetV1ResNet32x4 <br> ShuffleNetV2
Teacher <br> Teacher*74.64 <br> 78.4876.34 <br> 83.8576.45 <br> 80.7079.63 <br> 83.73
Student73.51±0.2673.51±0.2671.74±0.3572.96±0.33
HSAKD77.45±0.2178.79±0.1178.51±0.2079.93±0.11
HSAKD*79.27±0.1279.43±0.2480.11±0.3280.86±0.15

Training student networks under few-shot scenario

python train_student_few_shot.py \
    --tarch resnet56_aux \
    --arch resnet20_aux \
    --tcheckpoint ./checkpoint/train_teacher_cifar_arch_resnet56_aux_dataset_cifar100_seed0/resnet56_aux.pth.tar \
    --checkpoint-dir ./checkpoint \
    --data ./data/ \
    --few-ratio 0.25 \
    --gpu 2 --manual 0

--few-ratio: various percentages of training samples

Percentage25%50%75%100%
Student68.50±0.2472.18±0.4173.26±0.1173.73±0.10

Perform transfer experiments on STL-10 and TinyImageNet dataset

Dataset

STL-10: download

unzip to the ./data folder

TinyImageNet : download

unzip to the ./data folder

Prepare the TinyImageNet validation dataset as follows

cd data
python preprocess_tinyimagenet.py

Linear classification on STL-10

python eval_rep.py \
    --arch mobilenetV2 \
    --dataset STL-10 \
    --data ./data/  \
    --s-path ./checkpoint/train_student_cifar_tarch_vgg13_bn_aux_arch_mobilenetV2_aux_dataset_cifar100_seed0/mobilenetV2_aux.pth.tar

Linear classification on TinyImageNet

python eval_rep.py \
    --arch mobilenetV2 \
    --dataset TinyImageNet \
    --data ./data/tiny-imagenet-200/  \
    --s-path ./checkpoint/train_student_cifar_tarch_vgg13_bn_aux_arch_mobilenetV2_aux_dataset_cifar100_seed0/mobilenetV2_aux.pth.tar
Transferred DatasetCIFAR-100 → STL-10CIFAR-100 → TinyImageNet
Student74.6642.57

Perform Offline KD experiments on ImageNet dataset

Dataset preparation

$ ln -s PATH_TO_YOUR_IMAGENET ./data/

Folder of ImageNet Dataset:

data/ImageNet
├── train
├── val

Training teacher networks

(1) Use pre-trained backbone and train all auxiliary classifiers.

The pre-trained backbone weights of ResNet-34 follow the resnet34-333f7ec4.pth downloaded from the official PyTorch.

python train_teacher_imagenet.py \
    --dist-url 'tcp://127.0.0.1:55515' \
    --data ./data/ImageNet/ \
    --dist-backend 'nccl' \
    --multiprocessing-distributed \
    --checkpoint-dir ./checkpoint/ \
    --pretrained-backbone ./pretrained_backbones/resnet34-333f7ec4.pth \
    --freezed \
    --gpu 0,1,2,3,4,5,6,7 \
    --world-size 1 --rank 0 --manual_seed 0

(2) Train the backbone and all auxiliary classifiers jointly from scratch. In this case, we no longer need a pre-trained teacher backbone.

It can lead to a better accuracy for teacher backbone towards our empirical study.

python train_teacher_imagenet.py \
    --dist-url 'tcp://127.0.0.1:2222' \
    --data ./data/ImageNet/ \
    --dist-backend 'nccl' \
    --multiprocessing-distributed \
    --checkpoint-dir ./checkpoint/ \
    --gpu 0,1,2,3,4,5,6,7 \
    --world-size 1 --rank 0 --manual_seed 1

Training student networks

(1) using the teacher network of the version of a frozen backbone

python train_student_imagenet.py \
    --data ./data/ImageNet/ \
    --arch resnet18_imagenet_aux \
    --tarch resnet34_imagenet_aux \
    --tcheckpoint ./checkpoint/train_teacher_imagenet_arch_resnet34_aux_dataset_imagenet_seed0/resnet34_imagenet_aux_best.pth.tar \
    --dist-url 'tcp://127.0.0.1:2222' \
    --dist-backend 'nccl' \
    --multiprocessing-distributed \
    --gpu-id 0,1,2,3,4,5,6,7 \
    --world-size 1 --rank 0 --manual_seed 0

(2) using the teacher network of the joint training version

python train_student_imagenet.py \
    --data ./data/ImageNet/ \
    --arch resnet18_imagenet_aux \
    --tarch resnet34_imagenet_aux \
    --tcheckpoint ./checkpoint/train_teacher_imagenet_arch_resnet34_aux_dataset_imagenet_seed1/resnet34_imagenet_aux_best.pth.tar \
    --dist-url 'tcp://127.0.0.1:2222' \
    --dist-backend 'nccl' \
    --multiprocessing-distributed \
    --gpu-id 0,1,2,3,4,5,6,7 \
    --world-size 1 --rank 0 --manual_seed 1

Results on the teacher-student pair of ResNet-34 and ResNet-18

AccuracyTeacherTeacher*StudentHSAKDHSAKD*
Top-173.3175.4869.7572.1672.39
Top-591.4292.6789.0790.8591.00
Pretrained Modelsresnet34_0resnet34_1resnet18resnet18_0resnet18_1

Perform Online Mutual KD experiments on CIFAR-100 dataset

Online Mutual KD train two same networks to teach each other. More commands for training various student architectures can be found in train_online_kd_cifar.sh

NetworkBaselineHSSAKD (Online)
WRN-40-276.44±0.2082.58±0.21
WRN-40-171.95±0.5976.67±0.41
ResNet-5673.00±0.1778.16±0.56
ResNet-32x479.56±0.2384.91±0.19
VGG-1375.35±0.2180.44±0.05
MobileNetV273.51±0.2678.85±0.13
ShuffleNetV171.74±0.3578.34±0.03
ShuffleNetV272.96±0.3379.98±0.12

Perform Online Mutual KD experiments on ImageNet dataset

python train_online_kd_imagenet.py \
    --data ./data/ImageNet/ \
    --arch resnet18_resnet18 \
    --dist-url 'tcp://127.0.0.1:2222' \
    --dist-backend 'nccl' \
    --multiprocessing-distributed \
    --gpu-id 0,1,2,3,4,5,6,7 \
    --world-size 1 --rank 0
NetworkBaselineHSSAKD (Online)
ResNet-1869.7571.49

Citation

@inproceedings{yang2021hsakd,
  title={Hierarchical Self-supervised Augmented Knowledge Distillation},
  author={Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu},
  booktitle={Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI)},
  pages = {1217--1223},
  year={2021}
}

@article{yang2022hssakd,
  author={Yang, Chuanguang and An, Zhulin and Cai, Linhang and Xu, Yongjun},
  journal={IEEE Transactions on Neural Networks and Learning Systems}, 
  title={Knowledge Distillation Using Hierarchical Self-Supervision Augmented Distribution}, 
  year={2022},
  pages={1-15},
  doi={10.1109/TNNLS.2022.3186807}}