Home

Awesome

Vision Transformer

Pytorch reimplementation of Google's repository for the ViT model that was released with the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.

This paper show that Transformers applied directly to image patches and pre-trained on large datasets work really well on image recognition task.

fig1

Vision Transformer achieve State-of-the-Art in image recognition task with standard Transformer encoder and fixed-size patches. In order to perform classification, author use the standard approach of adding an extra learnable "classification token" to the sequence.

fig2

Usage

1. Download Pre-trained model (Google's Official Checkpoint)

# imagenet21k pre-train
wget https://storage.googleapis.com/vit_models/imagenet21k/{MODEL_NAME}.npz

# imagenet21k pre-train + imagenet2012 fine-tuning
wget https://storage.googleapis.com/vit_models/imagenet21k+imagenet2012/{MODEL_NAME}.npz

2. Train Model

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz

CIFAR-10 and CIFAR-100 are automatically download and train. In order to use a different dataset you need to customize data_utils.py.

The default batch size is 512. When GPU memory is insufficient, you can proceed with training by adjusting the value of --gradient_accumulation_steps.

Also can use Automatic Mixed Precision(Amp) to reduce memory usage and train faster

python3 train.py --name cifar10-100_500 --dataset cifar10 --model_type ViT-B_16 --pretrained_dir checkpoint/ViT-B_16.npz --fp16 --fp16_opt_level O2

Results

To verify that the converted model weight is correct, we simply compare it with the author's experimental results. We trained using mixed precision, and --fp16_opt_level was set to O2.

imagenet-21k

modeldatasetresolutionacc(official)acc(this repo)time
ViT-B_16CIFAR-10224x224-0.99083h 13m
ViT-B_16CIFAR-10384x3840.99030.990612h 25m
ViT_B_16CIFAR-100224x224-0.9233h 9m
ViT_B_16CIFAR-100384x3840.92640.922812h 31m
R50-ViT-B_16CIFAR-10224x224-0.98924h 23m
R50-ViT-B_16CIFAR-10384x3840.990.990415h 40m
R50-ViT-B_16CIFAR-100224x224-0.92314h 18m
R50-ViT-B_16CIFAR-100384x3840.92310.919715h 53m
ViT_L_32CIFAR-10224x224-0.99032h 11m
ViT_L_32CIFAR-100224x224-0.92762h 9m
ViT_H_14CIFAR-100224x224-WIP

imagenet-21k + imagenet2012

modeldatasetresolutionacc
ViT-B_16-224CIFAR-10224x2240.99
ViT_B_16-224CIFAR-100224x2240.9245
ViT-L_32CIFAR-10224x2240.9903
ViT-L_32CIFAR-100224x2240.9285

shorter train

upstreammodeldatasettotal_steps /warmup_stepsacc(official)acc(this repo)
imagenet21kViT-B_16CIFAR-10500/1000.98590.9859
imagenet21kViT-B_16CIFAR-101000/1000.98860.9878
imagenet21kViT-B_16CIFAR-100500/1000.89170.9072
imagenet21kViT-B_16CIFAR-1001000/1000.91150.9216

Visualization

The ViT consists of a Standard Transformer Encoder, and the encoder consists of Self-Attention and MLP module. The attention map for the input image can be visualized through the attention score of self-attention.

Visualization code can be found at visualize_attention_map.

fig3

Reference

Citations

@article{dosovitskiy2020,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and  Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
  journal={arXiv preprint arXiv:2010.11929},
  year={2020}
}