Home

Awesome

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang<sup>1,3</sup>, Xinggang Wang<sup>1 :email:</sup>, Yu Li<sup>4</sup>, Yuxin Fang<sup>1</sup>, Jiemin Fang<sup>1,2</sup>, Wenyu Liu<sup>1</sup>, Xun Zhao<sup>3</sup>, Ying Shan<sup>3</sup>.

<sup>1</sup> School of EIC, HUST, <sup>2</sup> AIA, HUST, <sup>3</sup> ARC Lab, Tencent PCG, <sup>4</sup> IDEA.

(<sup>:email:</sup>) corresponding author.

<img src="resources/gif/0b97736357.gif" width="33%"/><img src="resources/gif/00f88c4f0a.gif" width="33%"/><img src="resources/gif/2e21c7e59b.gif" width="33%"/> <img src="resources/gif/4b1a561480.gif" width="33%"/><img src="resources/gif/49fcb27427.gif" width="33%"/><img src="resources/gif/91eb6cb6dc.gif" width="33%"/> </br>

</br> <div align="center"> <img width="100%" alt="Overall Arch" src="resources/tevit.png"> </div> <!-- </br> --> <!-- <div align="center"> <img width="90%" alt="Overall Arch" src="resources/tevit_vis.png"> </div> --> <!-- </br> -->

Models and Main Results

NameAPAP@50AP@75AR@1AR@10Paramsmodelsubmission
TeViT_MsgShifT46.370.650.945.254.3161.83 Mlinklink
TeViT_MsgShifT_MST46.970.152.945.053.4161.83 Mlinklink
NameAPAP@50AP@75AR@1AR@10Paramsmodelsubmission
TeViT_R5042.167.844.841.349.9172.3 Mlinklink
TeViT_Swin-L_MST56.880.663.152.063.3343.86 Mlinklink

Installation

Prerequisites

Prepare

git clone https://github.com/hustvl/TeViT.git
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
python setup.py develop
TeViT
├── data
│   ├── youtubevis
│   │   ├── train
│   │   │   ├── 003234408d
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── ...
│   │   ├── annotations
│   │   │   ├── train.json
│   │   │   ├── valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT

Acknowledgement :heart:

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}