Home

Awesome

TinyMIM

😎 Introduction

This repository is the official implementation of our

TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models (CVPR2023)

[arxiv] [code]

Sucheng Ren, Fangyun Wei, Zheng Zhang, Han Hu

Small models that are critical for real-world applications but cannot or only marginally benefit from MIM pre-training. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred.

method

News

🛠 Installation

We build the repo based on MAE

🚀 Pretraining

We pretrain TinyMIM on 32 V100 GPU with overall batch size of 4096 which is identical to that in MAE.

python -m torch.distributed.launch \
--nnodes 4 --node_rank $noderank \
--nproc_per_node 8 --master_addr $ip --master_port $port \
main_pretrain.py \
    --batch_size 128 \
    --model tinymim_vit_base_patch16 \
    --epochs 300 \
    --warmup_epochs 15 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --teacher_path /path/to/teacher_ckpt \
    --teacher_model mae_vit_large \
    --data_path /path/to/imagenet 

Fine-tuning on ImageNet-1K (Classification)

python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
    --batch_size 128 \
    --model vit_base_patch16 \
    --finetune ./output_dir/checkpoint-299.pth \
    --epochs 100 \
    --output_dir ./out_finetune/ \
    --blr 5e-4 --layer_decay 0.65 \
    --weight_decay 0.05 --drop_path 0.2 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
    --dist_eval --data_path /path/to/imagenet

Fune-tuning on ADE20K (Semantic Segmentation)

Please refer Segmentation/README.md

Checkpoint

The pretrained and finetuned model on ImageNet-1K are available at

[Google Drive]

Comparison

Performance comparison on ImageNet-1K classification and ADE20K Semantic Segmentation.

MethodModel SizeTop-1mIoU
MAEViT-T71.637.6
TinyMIMViT-T75.844.0
TinyMIM*ViT-T79.645.0
MAEViT-S80.642.8
TinyMIMViT-S83.048.4
MAEViT-B83.648.1
TinyMIMViT-B85.052.2

Generalization comparison on out-of-domain dataset (ImageNet-A/R/C)

MethodModel SizeImageNet-1KImageNet-Adversarial $\uparrow$ImageNet-Rendition $\uparrow$ImageNet-Corruption $\downarrow$
MAEViT-T71.67.036.555.2
TinyMIMViT-T75.811.039.850.1
MAEViT-S80.620.145.640.6
TinyMIMViT-S83.027.548.835.8
MAEViT-B83.633.650.037.8
TinyMIMViT-B85.043.054.632.7

✍ Citation

If you have any question, feel free to contact Sucheng Ren :)

@InProceedings{Ren_2023_CVPR,
    author    = {Ren, Sucheng and Wei, Fangyun and Zhang, Zheng and Hu, Han},
    title     = {TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {3687-3697}
}