Home

Awesome

CSWin-Transformer, CVPR 2022

PWC PWC

This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows".

Introduction

CSWin Transformer (the name CSWin stands for Cross-Shaped Window) is introduced in arxiv, which is a new general-purpose backbone for computer vision. It is a hierarchical Transformer and replaces the traditional full attention with our newly proposed cross-shaped window self-attention. The cross-shaped window self-attention mechanism computes self-attention in the horizontal and vertical stripes in parallel that from a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. With CSWin, we could realize global attention with a limited computation cost.

CSWin Transformer achieves strong performance on ImageNet classification (87.5 on val with only 97G flops) and ADE20K semantic segmentation (55.7 mIoU on val), surpassing previous models by a large margin.

teaser

Main Results on ImageNet

modelpretrainresolutionacc@1#paramsFLOPs22K model1K model
CSWin-TImageNet-1K224x22482.823M4.3G-model
CSWin-SImageNet-1k224x22483.635M6.9G-model
CSWin-BImageNet-1k224x22484.278M15.0G-model
CSWin-BImageNet-1k384x38485.578M47.0G-model
CSWin-LImageNet-22k224x22486.5173M31.5Gmodelmodel
CSWin-LImageNet-22k384x38487.5173M96.8G-model

Main Results on Downstream Tasks

COCO Object Detection

backboneMethodpretrainlr Schdbox mAPmask mAP#paramsFLOPS
CSwin-TMask R-CNNImageNet-1K3x49.043.642M279G
CSwin-SMask R-CNNImageNet-1K3x50.044.554M342G
CSwin-BMask R-CNNImageNet-1K3x50.844.997M526G
CSwin-TCascade Mask R-CNNImageNet-1K3x52.545.380M757G
CSwin-SCascade Mask R-CNNImageNet-1K3x53.746.492M820G
CSwin-BCascade Mask R-CNNImageNet-1K3x53.946.4135M1004G

ADE20K Semantic Segmentation (val)

BackboneMethodpretrainCrop SizeLr SchdmIoUmIoU (ms+flip)#paramsFLOPs
CSwin-TSemantic FPNImageNet-1K512x51280K48.2-26M202G
CSwin-SSemantic FPNImageNet-1K512x51280K49.2-39M271G
CSwin-BSemantic FPNImageNet-1K512x51280K49.9-81M464G
CSwin-TUPerNetImageNet-1K512x512160K49.350.760M959G
CSwin-SUperNetImageNet-1K512x512160K50.451.565M1027G
CSwin-BUperNetImageNet-1K512x512160K51.152.2109M1222G
CSwin-BUPerNetImageNet-22K640x640160K51.852.6109M1941G
CSwin-LUperNetImageNet-22K640x640160K53.455.7208M2745G

pretrained models and code could be found at segmentation

Requirements

timm==0.3.4, pytorch>=1.4, opencv, ... , run:

bash install_req.sh

Apex for mixed precision training is used for finetuning. To install apex, run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data prepare: ImageNet with the following folder structure, you can extract imagenet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Train

Train the three lite variants: CSWin-Tiny, CSWin-Small and CSWin-Base:

bash train.sh 8 --data <data path> --model CSWin_64_12211_tiny_224 -b 256 --lr 2e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99984 --drop-path 0.2
bash train.sh 8 --data <data path> --model CSWin_64_24322_small_224 -b 256 --lr 2e-3 --weight-decay .05 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99984 --drop-path 0.4
bash train.sh 8 --data <data path> --model CSWin_96_24322_base_224 -b 128 --lr 1e-3 --weight-decay .1 --amp --img-size 224 --warmup-epochs 20 --model-ema-decay 0.99992 --drop-path 0.5

If you want to train our CSWin on images with 384x384 resolution, please use '--img-size 384'.

If the GPU memory is not enough, please use '-b 128 --lr 1e-3 --model-ema-decay 0.99992' or use checkpoint '--use-chk'.

Finetune

Finetune CSWin-Base with 384x384 resolution:

bash finetune.sh 8 --data <data path> --model CSWin_96_24322_base_384 -b 32 --lr 5e-6 --min-lr 5e-7 --weight-decay 1e-8 --amp --img-size 384 --warmup-epochs 0 --model-ema-decay 0.9998 --finetune <pretrained 224 model> --epochs 20 --mixup 0.1 --cooldown-epochs 10 --drop-path 0.7 --ema-finetune --lr-scale 1 --cutmix 0.1

Finetune ImageNet-22K pretrained CSWin-Large with 224x224 resolution:

bash finetune.sh 8 --data <data path> --model CSWin_144_24322_large_224 -b 64 --lr 2.5e-4 --min-lr 5e-7 --weight-decay 1e-8 --amp --img-size 224 --warmup-epochs 0 --model-ema-decay 0.9996 --finetune <22k-pretrained model> --epochs 30 --mixup 0.01 --cooldown-epochs 10 --interpolation bicubic  --lr-scale 0.05 --drop-path 0.2 --cutmix 0.3 --use-chk --fine-22k --ema-finetune

If the GPU memory is not enough, please use checkpoint '--use-chk'.

Cite CSWin Transformer

@misc{dong2021cswin,
      title={CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows}, 
        author={Xiaoyi Dong and Jianmin Bao and Dongdong Chen and Weiming Zhang and Nenghai Yu and Lu Yuan and Dong Chen and Baining Guo},
        year={2021},
        eprint={2107.00652},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
}

Acknowledgement

This repository is built using the timm library and the DeiT repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using CSWin Transformer, please submit a GitHub issue.

For other communications related to CSWin Transformer, please contact Jianmin Bao (jianbao@microsoft.com), Dong Chen (doch@microsoft.com).