Home

Awesome

MPViT : Multi-Path Vision Transformer for Dense Prediction

This repository inlcudes official implementations and model weights for MPViT.

[Arxiv] [BibTeX]

MPViT : Multi-Path Vision Transformer for Dense Prediction<br> :classical_building:️️:school:Youngwan Lee, :classical_building:️️Jonghee Kim, :school:Jeff Willette, :school:Sung Ju Hwang <br> ETRI:classical_building:️, KAIST:school: <br>

News

🎉 MPViT has been accepted in CVPR2022.

Abstract

We explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size (i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse and multi-scale feature representations, our MPViTs scaling from Tiny(5M) to Base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks.

<div align="center"> <img src="https://dl.dropbox.com/s/qsp5scrd9okl3pw/mpvit_plot1.png" width="850px" /> </div> <div align="center"> <img src="https://dl.dropbox.com/s/dsaqd0cc9ryzqim/mpvit_plot2.png" width="850px" /> </div>

Main results on ImageNet-1K

:rocket: These all models are trained on ImageNet-1K with the same training recipe as DeiT and CoaT.

modelresolutionacc@1#paramsFLOPsweight
MPViT-T224x22478.25.8M1.6Gweight
MPViT-XS224x22480.910.5M2.9Gweight
MPViT-S224x22483.022.8M4.7Gweight
MPViT-B224x22484.374.8M16.4Gweight

Main results on COCO object detection

:rocket: All model are trained using ImageNet-1K pretrained weights.

:sunny: MS denotes the same multi-scale training augmentation as in Swin-Transformer which follows the MS augmentation as in DETR and Sparse-RCNN. Therefore, we also follows the official implementation of DETR and Sparse-RCNN which are also based on Detectron2.

Please refer to detectron2/ for the details.

BackboneMethodlr Schdbox mAPmask mAP#paramsFLOPSweight
MPViT-TRetinaNet1x41.8-17M196G<a href="https://dl.dropbox.com/s/0pep3jnx3zvt1zc/retinanet_mpvit_tiny_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/5fpuicgbk2i2sp2/retinanet_mpvit_tiny_1x_metrics.json">metrics</a>
MPViT-XSRetinaNet1x43.8-20M211G<a href="https://dl.dropbox.com/s/4oh8h8wag6yhrir/retinanet_mpvit_xs_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/2jm7b0uj5wfa45f/retinanet_mpvit_xs_1x_metrics.json">metrics</a>
MPViT-SRetinaNet1x45.7-32M248G<a href="https://dl.dropbox.com/s/cbcvz3y6t9hun6l/retinanet_mpvit_small_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/d9zyltgy4o6eb28/retinanet_mpvit_small_1x_metrics.json">metrics</a>
MPViT-BRetinaNet1x47.0-85M482G<a href="https://dl.dropbox.com/s/hznu2ljqbh0fr1z/retinanet_mpvit_base_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/kettv7sk5ett9qz/retinanet_mpvit_base_1x_metrics.json">metrics</a>
MPViT-TRetinaNetMS+3x44.4-17M196G<a href="https://dl.dropbox.com/s/o66ht73g1shpwhn/retinanet_mpvit_tiny_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/4slpgagl49vl37h/retinanet_mpvit_tiny_ms_3x_metrics.json">metrics</a>
MPViT-XSRetinaNetMS+3x46.1-20M211G<a href="https://dl.dropbox.com/s/8kxauovyyaq8x5b/retinanet_mpvit_xs_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/2n9pmm8nbb1ikry/retinanet_mpvit_xs_ms_3x_metrics.json">metrics</a>
MPViT-SRetinaNetMS+3x47.6-32M248G<a href="https://dl.dropbox.com/s/gh00mdtqxoic64e/retinanet_mpvit_small_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/zkmblogkjk9t347/retinanet_mpvit_small_ms_3x_metrics.json">metrics</a>
MPViT-BRetinaNetMS+3x48.3-85M482G<a href="https://dl.dropbox.com/s/z7scimhn6dy06kh/retinanet_mpvit_base_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/d5n3ujikitnghvo/retinanet_mpvit_base_ms_3x_metrics.json">metrics</a>
MPViT-TMask R-CNN1x42.239.028M216G<a href="https://dl.dropbox.com/s/pxregez7a3hdqzl/mask_rcnn_mpvit_tiny_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/juczvf6jlx131pn/mask_rcnn_mpvit_tiny_1x_metrics.json">metrics</a>
MPViT-XSMask R-CNN1x44.240.430M231G<a href="https://dl.dropbox.com/s/os9vk9co87ppg1y/mask_rcnn_mpvit_xs_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/4rhc3gzuhrp7b0a/mask_rcnn_mpvit_xs_1x_metrics.json">metrics</a>
MPViT-SMask R-CNN1x46.442.443M268G<a href="https://dl.dropbox.com/s/ucfwkf65qqklcqn/mask_rcnn_mpvit_small_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/9lyuwyc509q69o9/mask_rcnn_mpvit_small_1x_metrics.json">metrics</a>
MPViT-BMask R-CNN1x48.243.595M503G<a href="https://dl.dropbox.com/s/m7p17jp5qaf41lm/mask_rcnn_mpvit_base_1x.pth">model</a> | <a href="https://dl.dropbox.com/s/v639wuwa08729mn/mask_rcnn_mpvit_base_1x_metrics.json">metrics</a>
MPViT-TMask R-CNNMS+3x44.841.028M216G<a href="https://dl.dropbox.com/s/2wu26zurp5u5057/mask_rcnn_mpvit_tiny_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/6fz98386gix3nif/mask_rcnn_mpvit_tiny_ms_3x_metrics.json">metrics</a>
MPViT-XSMask R-CNNMS+3x46.642.330M231G<a href="https://dl.dropbox.com/s/yw85vk53kcdi9ed/mask_rcnn_mpvit_xs_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/3prmnkynixtmw4f/mask_rcnn_mpvit_xs_ms_3x_metrics.json">metrics</a>
MPViT-SMask R-CNNMS+3x48.443.943M268G<a href="https://dl.dropbox.com/s/b0fohmjmggahnny/mask_rcnn_mpvit_small_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/fcfpo2qcfzydsyc/mask_rcnn_mpvit_small_ms_3x_metrics.json">metrics</a>
MPViT-BMask R-CNNMS+3x49.544.595M503G<a href="https://dl.dropbox.com/s/9apn9ywk5ujk01s/mask_rcnn_mpvit_base_ms_3x.pth">model</a> | <a href="https://dl.dropbox.com/s/jcdh98hir236e9x/mask_rcnn_mpvit_base_ms_3x_metrics.json">metrics</a>

Deformable-DETR

All models are trained using the same training recipe.

Please refer to deformable_detr/ for the details.

backbonebox mAPepochslink
ResNet-5044.550-
CoaT-lite S47.050link
CoaT-S48.450link
MPViT-S49.050link

Main results on ADE20K Semantic segmentation

All model are trained using ImageNet-1K pretrained weight.

Please refer to semantic_segmentation/ for the details.

BackboneMethodCrop SizeLr SchdmIoU#paramsFLOPsweight
MPViT-SUperNet512x512160K48.352M943Gweight
MPViT-BUperNet512x512160K50.3105M1185Gweight

Getting Started

:raised_hand: We use pytorch==1.7.0 torchvision==0.8.1 cuda==10.1 libraries on NVIDIA V100 GPUs. If you use different versions of cuda, you may obtain different accuracies, but the differences are negligible.

Acknowledgement

This repository is built using the Timm library, DeiT, CoaT, Detectron2, mmsegmentation repositories.

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network and No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis).

License

Please refer to MPViT LSA.

<a name="CitingMPViT"></a>Citing MPViT

@inproceedings{lee2022mpvit,
      title={MPViT: Multi-Path Vision Transformer for Dense Prediction}, 
      author={Youngwan Lee and Jonghee Kim and Jeffrey Willette and Sung Ju Hwang},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year={2022}
}