Home

Awesome

MobileViTv3 : Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features [arXiv]

This repository contains MobileViTv3's source code for training and evaluation. It uses the CVNets library and is inspired by MobileViT (paper, code).

Installation and Training Models:

We recommend to use Python 3.8+ and PyTorch (version >= v1.8.0) with conda environment. For setting-up the python environment with conda, see here.

MobileViTv3-S,XS,XXS

Download MobileViTv1 and replace the files provided in MobileViTv3-v1. Conda environment used for training: environment_cvnet.yml. Then install according to instructions provided in the downloaded repository. For training, use training-and-evaluation readme provided in the downloaded repository.

MobileViTv3-1.0,0.75,0.5

Download MobileViTv2 and replace the files provided in MobileViTv3-v2. Conda environment used for training: environment_mbvt2.yml Then install according to instructions provided in the downloaded repository. For training, use training-and-evaluation readme provided in the downloaded repository.

Trained models:

Download the trained MobileViTv3 models from here. checkpoint_ema_best.pt files inside the model folder is used to generated the accuracy of models. Low-latency models are build by reducing the number of MobileViTv3-blocks in 'layer4' from 4 to 2. Please refer to the paper for more details. Note that for the segmentation and detection, only the backbone architecture parameters are listed.

Classification

ImageNet-1K:

Model nameAccuracy (%)Parameters (Million)FLOPs (Million)Foldername
MobileViTv3-S79.35.81841mobilevitv3_S_e300_7930
MobileViTv3-XS76.72.5927mobilevitv3_XS_e300_7671
MobileViTv3-XXS70.981.2289mobilevitv3_XXS_e300_7098
MobileViTv3-1.078.645.11876mobilevitv3_1_0_0
MobileViTv3-0.7576.553.01064mobilevitv3_0_7_5
MobileViTv3-0.572.331.4481mobilevitv3_0_5_0

ImageNet-1K using low-latency models:

Model nameAccuracy (%)Parameters (Million)FLOPs (Million)Foldername
MobileViTv3-S-L279.065.21651mobilevitv3_S_L2_e300_7906
MobileViTv3-XS-L276.102.3853mobilevitv3_XS_L2_e300_7610
MobileViTv3-XXS-L270.231.1256mobilevitv3_XXS_L2_e300_7023

Segmentation

PASCAL VOC 2012:

Model namemIoUParameters (Million)Foldername
MobileViTv3-S79.597.2mobilevitv3_S_voc_e50_7959
MobileViTv3-XS78.773.3mobilevitv3_XS_voc_e50_7877
MobileViTv3-XXS74.042.0mobilevitv3_XXS_voc_e50_7404
MobileViTv3-1.080.0413.6mobilevitv3_voc_1_0_0
MobileViTv3-0.576.486.3mobilevitv3_voc_0_5_0

ADE20K:

Model namemIoUParameters (Million)Foldername
MobileViTv3-1.039.1313.6mobilevitv3_ade20k_1_0_0
MobileViTv3-0.7536.439.7mobilevitv3_ade20k_0_7_5
MobileViTv3-0.533.576.4mobilevitv3_ade20k_0_5_0

Detection MS-COCO:

Model namemAPParameters (Million)Foldername
MobileViTv3-S27.35.5mobilevitv3_S_coco_e200_2730
MobileViTv3-XS25.62.7mobilevitv3_XS_coco_e200_2560
MobileViTv3-XXS19.31.5mobilevitv3_XXS_coco_e200_1930
MobileViTv3-1.027.05.8mobilevitv3_coco_1_0_0
MobileViTv3-0.7525.03.7mobilevitv3_coco_0_7_5
MobileViTv3-0.521.82.0mobilevitv3_coco_0_5_0

Citation

If you find this repository useful, please consider giving a star :star: and citation :mega::

@inproceedings{wadekar2022mobilevitv3,
  title = {MobileViTv3: Mobile-Friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features},
  author = {Wadekar, Shakti N. and Chaurasia, Abhishek},
  doi = {10.48550/ARXIV.2209.15159},
  year = {2022}
}