Awesome
ScalableViT
This is the code of paper "ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer".
It currently includes code and models for the following tasks:
✅ Image Classification
❌ Object Detection
❌ Semantic Segmentation
Introduction
ScalableViT (Scalable Vision Transformer) inculdes Scalable Self-Attention (SSA) and Interactive Window-based Self-Attention (IWSA) mechanisms.
SSA leverages two scaling factors to release dimensions of $query$, $key$, and $value$ matrices.
IWSA establishes interaction between non-overlapping regions by re-merging independent $value$ tokens and aggregating spatial information from adjacent windows.
By stacking the SSA and IWSA alternately, ScalableViT-S achieves $83.1 %$ acc on ImageNet-1K.
Main results
Image Classification on ImageNet
Model | #Param.(M) | FLOPs(G) | top1-acc |
---|
ScalableViT-S | 32.4 | 4.2 | 83.1 |
ScalableViT-B | 81.9 | 8.6 | 84.1 |
ScalableViT-L | 104.9 | 14.7 | 84.4 |
Object Detection on COCO
RetinaNet
Backbone | Pretrain | Lr Schd | #Param.(M) | FLOPs(G) | bbox mAP |
---|
ScalableViT-S | ImageNet-1K | 1x | 36.4 | 238 | 45.2 |
ScalableViT-S | ImageNet-1K | 3x | 36.4 | 238 | 47.8 |
ScalableViT-B | ImageNet-1K | 1x | 85.6 | 330 | 45.8 |
ScalableViT-B | ImageNet-1K | 3x | 85.6 | 330 | 48.0 |
ScalableViT-L | ImageNet-1K | 1x | 112.6 | 457 | 46.8 |
Mask R-CNN
Backbone | Pretrain | Lr Schd | #Param.(M) | FLOPs(G) | bbox mAP | mask mAP |
---|
ScalableViT-S | ImageNet-1K | 1x | 46.3 | 256 | 45.8 | 41.7 |
ScalableViT-S | ImageNet-1K | 3x | 46.3 | 256 | 48.7 | 43.6 |
ScalableViT-B | ImageNet-1K | 1x | 94.9 | 349 | 46.6 | 42.1 |
ScalableViT-B | ImageNet-1K | 3x | 94.9 | 349 | 48.9 | 43.6 |
ScalableViT-L | ImageNet-1K | 1x | 121.4 | 477 | 47.6 | 42.9 |
Semantic Segmentation on ADE20K
Semantic FPN
Backbone | Method | Crop Size | Lr Schd | #Param.(M) | FLOPs(G) | mIoU |
---|
ScalableViT-S | Semantic FPN | 512x512 | 80K | 30.4 | 174 | 44.9 |
ScalableViT-B | Semantic FPN | 512x512 | 80K | 79.0 | 270 | 48.4 |
ScalableViT-L | Semantic FPN | 512x512 | 80K | 105.5 | 402 | 49.4 |
UperNet
Backbone | Method | Crop Size | Lr Schd | #Param.(M) | FLOPs(G) | mIoU | mIoU (ms+flip) |
---|
ScalableViT-S | UperNet | 512x512 | 160K | 56.5 | 931 | 48.5 | 49.4 |
ScalableViT-B | UperNet | 512x512 | 160K | 107.0 | 1029 | 49.5 | 50.4 |
ScalableViT-L | UperNet | 512x512 | 160K | 135.5 | 1162 | 49.7 | 50.7 |
Citation
@article{ScalableViT,
title={ScalableViT: Rethinking the context-oriented generalization of vision transformer},
author={Yang, Rui and Ma, Hailong and Wu, Jie and Tang, Yansong and Xiao, Xuefeng and Zheng, Min and Li, Xiu},
journal={arXiv preprint arXiv:2203.10790},
year={2022}
}