Home

Awesome

Visformer

pytorch

Introduction

This is a pytorch implementation for the Visformer models. This project is based on the training code in DeiT and the tools in timm.

Usage

Clone the repository:

git clone https://github.com/danczs/Visformer.git

Install pytorch, timm and einops:

pip install -r requirements.txt

Data Preparation

The layout of Imagenet data:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img1.jpeg
    class2/
      img2.jpeg

Network Training

Visformer_small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save

Visformer_tiny

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save

Viformer V2 models

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model swin_visformer_small_v2 --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model swin_visformer_tiny_v2 --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

The model performance:

modeltop-1 (%)FLOPs (G)paramters (M)
Visformer_tiny78.61.310.3
Visformer_tiny_V279.61.39.4
Visformer_small82.24.940.2
Visformer_small_V283.04.323.6
Visformer_medium_V283.68.544.5

pre-trained models:

modelmodellogtop-1 (%)
Visformer_small (original)githubgithub82.21
Visformer_small (+ Swin for downstream tasks)githubgithub82.34
Visformer_small_v2 (+ Swin for downstream tasks)githubgithub83.00
Visformer_medium_v2 (+ Swin for downstream tasks)githubgithub83.62

(In some logs, the model is only tested for the last 50 epochs to save the training time.)

More information about Visformer V2.

Object Detection on COCO

The standard self-attention is not efficient for high-reolution inputs, so we simply replace the standard self-attention with Swin-attention for object detection. Therefore, Swin Transformer is our directly baseline.

Mask R-CNN

Backboneschedbox mAPmask mAPparamsFLOPsFPS
Swin-T1x42.639.34826714.8
Visformer-S1x43.039.66027513.1
VisformerV2-S1x44.840.74326215.2
Swin-T3x + MS46.041.64836714.8
VisformerV2-S3x + MS47.842.54326215.2

Cascade Mask R-CNN

Backboneschedbox mAPmask mAPparamsFLOPsFPS
Swin-T1x + MS48.141.7867459.5
VisformerV2-S1x + MS49.342.3817409.6
Swin-T3x + MS50.543.7867459.5
VisformerV2-S3x + MS51.644.1817409.6

This repo only contains the key files for object detection ('./ObjectDetction'). Swin-Visformer-Object-Detection is the full detection project.

Pre-trained Model

Beacause of the policy of our institution, we cannot send the pre-trained models out directly. Thankfully, @hzhang57 and @developer0hye provides Visformer_small and Visformer_tiny models trained by themselves.

Automatic Mixed Precision (amp)

In the original version of Visformer, amp can cause NaN values. We find that the overflow comes from the attention mask:

scale = head_dim ** -0.5
attn = ( q  @ k.transpose(-2,-1) ) * scale

To avoid overflow, we pre-normalize q & k, and, thus, overall normalize 'attn' with 'head_dim' instead of 'head_dim ** 0.5':

scale = head_dim ** -0.5
attn =  (q * scale) @ (k.transpose(-2,-1) * scale) 

Amp training:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

This change won't degrade the training performance.

Using amp for the original pre-trained models:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --eval --resume /path/to/weights --amp

Citing

@inproceedings{chen2021visformer,
  title={Visformer: The vision-friendly transformer},
  author={Chen, Zhengsu and Xie, Lingxi and Niu, Jianwei and Liu, Xuefeng and Wei, Longhui and Tian, Qi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={589--598},
  year={2021}
}