Home

Awesome

<h1 align="left">[ECCV 2022] VSA: Learning Varied-Size Window Attention in Vision Transformers<a href="https://arxiv.org/abs/2204.08446"><img src="https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg" ></a></h1> <p align="center"> <a href="#Updates">Updates</a> | <a href="#introduction">Introduction</a> | <a href="#statement">Statement</a> | </p>

Current applications

Classification: Please see <a href="https://github.com/ViTAE-Transformer/ViTAE-VSA/tree/main/Image-Classification">ViTAE-VSA for Image Classification</a> for usage detail;

Object Detection: Please see <a href="https://github.com/ViTAE-Transformer/ViTAE-VSA/tree/main/Object-Detection">ViTAE-VSA for Object Detection</a> for usage detail;

Semantic Segmentation: Will be released in next few days;

Other ViTAE applications

ViTAE & ViTAEv2: Please see <a href="https://github.com/ViTAE-Transformer/ViTAE-Transformer">ViTAE-Transformer for Image Classification, Object Detection, and Sementic Segmentation</a>;

Matting: Please see <a href="https://github.com/ViTAE-Transformer/ViTAE-Transformer-Matting">ViTAE-Transformer for matting</a>;

Remote Sensing: Please see <a href="https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing">ViTAE-Transformer for Remote Sensing</a>;

Updates

19/09/2022

09/07/2022

19/04/2022

Introduction

<p align="left">This repository contains the code, models, test results for the paper <a href="https://arxiv.org/pdf/2204.08446.pdf">VSA: Learning Varied-Size Window Attention in Vision Transformers</a>. We design a novel varied-size window attention module which learns adaptive window configurations from data. By adopting VSA in each head independently, the model can capture long-range dependencies and rich context information from diverse windows. VSA can replace the window attention in SOTA methods and faciliate the learning on various vision tasks including classification, detection and segmentation. <figure> <img src="figs/illustration.jpg"> <figcaption align = "center"><b>Fig.1 - The comparison of the current design (hand-crafted windows) and VSA.</b></figcaption> </figure> <figure> <img src="figs/architecture.png"> <figcaption align = "center"><b>Fig.2 - The architecture of VSA .</b></figcaption> </figure>

Usage

If you are interested in using the VSA attention only, please consider this file in classification or the VSAWindowAttention Class in object detection applications.

Classification Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Main Results on ImageNet-1K with pretrained models

nameresolutionacc@1acc@5acc@RealTop-1Pretrained
Swin-T224x22481.2\\\
Swin-T+VSA224x22482.2495.8\Coming Soon
ViTAEv2*-S224x22482.296.187.5\
ViTAEv2-S224x22482.696.287.6weights&logs
ViTAEv2*-S+VSA224x22482.796.387.7weights&logs
Swin-S224x22483.0\\\
Swin-S+VSA224x22483.696.6\Coming Soon
ViTAEv2*-48M+VSA224x22483.996.6\weights&logs

Models with ImageNet-22K pretraining

nameresolutionacc@1acc@5acc@RealTop-1Pretrained
ViTAEv2*-48M+VSA224x22484.997.4\Coming Soon
ViTAEv2*-B+VSA224x22486.297.990.0Coming Soon

Object Detection Results

ViTAEv2* denotes the version using window attention for all stages, which have much less memory requirements anc computations.

Mask R-CNN

BackbonePretrainLr Schdbox mAPmask mAP#paramsconfiglogmodel
ViTAEv2*-SImageNet-1K1x43.539.437M\\\
ViTAEv2-SImageNet-1K1x46.341.837MconfiggithubComing Soon
ViTAEv2*-S+VSAImageNet-1K1x45.941.437Mconfiggithubcoming soon
ViTAEv2*-SImageNet-1K3x44.740.039M\\\
ViTAEv2-SImageNet-1K3x47.842.637MconfiggithubComing Soon
ViTAEv2*-S+VSAImageNet-1K3x48.142.939MconfiggithubComing Soon
ViTAEv2*-48M+VSAImageNet-1K3x49.944.269MconfiggithubComing Soon

Cascade Mask R-CNN

BackbonePretrainLr Schdbox mAPmask mAP#paramsconfiglogmodel
ViTAEv2*-SImageNet-1K1x47.340.677M\\\
ViTAEv2-SImageNet-1K1x50.643.675MconfiggithubComing Soon
ViTAEv2*-S+VSAImageNet-1K1x49.843.077MconfiggithubComing Soon
ViTAEv2*-SImageNet-1K3x48.041.377M\\\
ViTAEv2-SImageNet-1K3x51.444.575MconfiggithubComing Soon
ViTAEv2*-S+VSAImageNet-1K3x51.944.877MconfiggithubComing Soon
ViTAEv2*-48M+VSAImageNet-1k3x52.945.6108Mconfiggithubcoming soon
<!-- | Swin-T | ImageNet-1K | 3x | 50.2 | 43.5 | 86M | [config](configs/swin/cascade_mask_rcnn_swin_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.3/moby_cascade_mask_rcnn_swin_tiny_patch4_window7_3x.log.json)/[baidu](https://pan.baidu.com/s/1zEFXHYjEiXUCWF1U7HR5Zg) | [github](https://github.com/SwinTransformer/storage/releases/download/v1.0.3/moby_cascade_mask_rcnn_swin_tiny_patch4_window7_3x.pth)/[baidu](https://pan.baidu.com/s/1FMmW0GOpT4MKsKUrkJRgeg) | -->

Semantic Segmentation Results for Cityscapes

ViTAEv2* denotes the version using window attention for all stages.

UperNet

512x1024 resolution for training and testing

BackbonePretrainLr SchdmIoUmIoU*#paramsconfiglogmodel
Swin-TImageNet-1k40k78.979.9\\\\
Swin-T+VSAImageNet-1k40k80.881.7\\\\
ViTAEv2*-SImageNet-1k40k80.180.9\\\\
ViTAEv2*-S+VSAImageNet-1k40k81.482.3\\\\
Swin-TImageNet-1k80k79.380.2\\\\
Swin-T+VSAImageNet-1k80k81.682.4\\\\
ViTAEv2*-SImageNet-1k80k80.881.0\\\\
ViTAEv2*-S+VSAImageNet-1k80k82.283.0\\\\

769x769 resolution for training and testing

BackbonePretrainLr SchdmIoUms mIoU#paramsconfiglogmodel
Swin-TImageNet-1k40k79.380.1\\\\
Swin-T+VSAImageNet-1k40k81.081.9\\\\
ViTAEv2*-SImageNet-1k40k79.680.6\\\\
ViTAEv2*-S+VSAImageNet-1k40k81.782.5\\\\
Swin-TImageNet-1k80k79.680.1\\\\
Swin-T+VSAImageNet-1k80k81.682.5\\\\

Please refer to our paper for more experimental results.

Statement

This project is for research purpose only. For any other questions please contact qmzhangzz at hotmail.com yufei.xu at outlook.com.

The code base is borrowed from T2T, ViTAEv2 and Swin.

Citing VSA and ViTAE

@article{zhang2022vsa,
  title={VSA: Learning Varied-Size Window Attention in Vision Transformers},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2204.08446},
  year={2022}
}
@article{zhang2022vitaev2,
  title={ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond},
  author={Zhang, Qiming and Xu, Yufei and Zhang, Jing and Tao, Dacheng},
  journal={arXiv preprint arXiv:2202.10108},
  year={2022}
}
@article{xu2021vitae,
  title={Vitae: Vision transformer advanced by exploring intrinsic inductive bias},
  author={Xu, Yufei and Zhang, Qiming and Zhang, Jing and Tao, Dacheng},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}