Home

Awesome

ScalableViT

This is the code of paper "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer".

It currently includes code and models for the following tasks:

Image Classification

Object Detection

Semantic Segmentation

Introduction

ScalableViT (Scalable Vision Transformer) inculdes Scalable Self-Attention (SSA) and Interactive Window-based Self-Attention (IWSA) mechanisms. SSA leverages two scaling factors to release dimensions of $query$, $key$, and $value$ matrices. IWSA establishes interaction between non-overlapping regions by re-merging independent $value$ tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, ScalableViT-S achieves $83.1 %$ acc on ImageNet-1K.

Architecture

Main results

Image Classification on ImageNet

Model#Param.(M)FLOPs(G)top1-acc
ScalableViT-S32.44.283.1
ScalableViT-B81.98.684.1
ScalableViT-L104.914.784.4

Object Detection on COCO

RetinaNet

BackbonePretrainLr Schd#Param.(M)FLOPs(G)bbox mAP
ScalableViT-SImageNet-1K1x36.423845.2
ScalableViT-SImageNet-1K3x36.423847.8
ScalableViT-BImageNet-1K1x85.633045.8
ScalableViT-BImageNet-1K3x85.633048.0
ScalableViT-LImageNet-1K1x112.645746.8

Mask R-CNN

BackbonePretrainLr Schd#Param.(M)FLOPs(G)bbox mAPmask mAP
ScalableViT-SImageNet-1K1x46.325645.841.7
ScalableViT-SImageNet-1K3x46.325648.743.6
ScalableViT-BImageNet-1K1x94.934946.642.1
ScalableViT-BImageNet-1K3x94.934948.943.6
ScalableViT-LImageNet-1K1x121.447747.642.9

Semantic Segmentation on ADE20K

Semantic FPN

BackboneMethodCrop SizeLr Schd#Param.(M)FLOPs(G)mIoU
ScalableViT-SSemantic FPN512x51280K30.417444.9
ScalableViT-BSemantic FPN512x51280K79.027048.4
ScalableViT-LSemantic FPN512x51280K105.540249.4

UperNet

BackboneMethodCrop SizeLr Schd#Param.(M)FLOPs(G)mIoUmIoU (ms+flip)
ScalableViT-SUperNet512x512160K56.593148.549.4
ScalableViT-BUperNet512x512160K107.0102949.550.4
ScalableViT-LUperNet512x512160K135.5116249.750.7

Citation

@article{ScalableViT,
  title={ScalableViT: Rethinking the context-oriented generalization of vision transformer},
  author={Yang, Rui and Ma, Hailong and Wu, Jie and Tang, Yansong and Xiao, Xuefeng and Zheng, Min and Li, Xiu},
  journal={arXiv preprint arXiv:2203.10790},
  year={2022}
}