Awesome

TokShift-Transformer

This is official implementaion of paper "Token Shift Transformer for Video Classification". We achieve SOTA performance 80.40% on Kinetics-400 val. Paper link

Updates
Model Zoo and Baselines
Installation
Quick Start
Contributors
Citing
Acknowledgement

Updates

July 11, 2021

Release this V1 version (the version used in paper) to public.
we are preparing a V2 version which include the following modifications, will release within 1 week:

Directly decode video mp4 file during training/evaluation
Change to adopt standarlize timm code-base.
Performances are further improved than reported in paper version (average +0.5).

April 22, 2021

Add Train/Test guidline and Data perpariation

April 16, 2021

Publish TokShift Transformer for video content understanding

Model Zoo and Baselines

architecture	backbone	pretrain	Res & Frames	GFLOPs x views	top1	config
ViT (Video)	Base16	ImgNet21k	224 & 8	134.7 x 30	76.02 `link`	k400_vit_8x32_224.yml
TokShift	Base-16	ImgNet21k	224 & 8	134.7 x 30	77.28 `link`	k400_tokshift_div4_8x32_base_224.yml
TokShift (MR)	Base16	ImgNet21k	256 & 8	175.8 x 30	77.68 `link`	k400_tokshift_div4_8x32_base_256.yml
TokShift (HR)	Base16	ImgNet21k	384 & 8	394.7 x 30	78.14 `link`	k400_tokshift_div4_8x32_base_384.yml
TokShift	Base16	ImgNet21k	224 & 16	268.5 x 30	78.18 `link`	k400_tokshift_div4_16x32_base_224.yml
TokShift-Large (HR)	Large16	ImgNet21k	384 & 8	1397.6 x 30	79.83 `link`	k400_tokshift_div4_8x32_large_384.yml
TokShift-Large (HR)	Large16	ImgNet21k	384 & 12	2096.4 x 30	80.40 `link`	k400_tokshift_div4_12x32_large_384.yml

Below is trainig log, we use 3 views evaluation (instead of 30 views) during validation for time-saving.

Installation

PyTorch >= 1.7, torchvision
tensorboardx

Quick Start

Train

Download ImageNet-22k pretrained weights from Base16 and Large16.
Prepare Kinetics-400 dataset organized in the following structure, trainValTest

k400
|_ frames331_train
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ frames331_val
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|
|_ trainValTest
   |_ train.txt
   |_ val.txt

Using train-script (train.sh) to train k400

#!/usr/bin/env python
import os

cmd = "python -u main_ddp_shift_v3.py \
		--multiprocessing-distributed --world-size 1 --rank 0 \
		--dist-ur tcp://127.0.0.1:23677 \
		--tune_from pretrain/ViT-L_16_Img21.npz \
		--cfg config/custom/kinetics400/k400_tokshift_div4_12x32_large_384.yml"
os.system(cmd)

Test

Using test.sh (test.sh) to evaluate k400

#!/usr/bin/env python
import os
cmd = "python -u main_ddp_shift_v3.py \
        --multiprocessing-distributed --world-size 1 --rank 0 \
        --dist-ur tcp://127.0.0.1:23677 \
        --evaluate \
        --resume model_zoo/ViT-B_16_k400_dense_cls400_segs8x32_e18_lr0.1_B21_VAL224/best_vit_B8x32x224_k400.pth \
        --cfg config/custom/kinetics400/k400_vit_8x32_224.yml"
os.system(cmd)

Contributors

VideoNet is written and maintained by Dr. Hao Zhang and Dr. Yanbin Hao.

Citing

If you find TokShift-xfmr is useful in your research, please use the following BibTeX entry for citation.

@article{tokshift2021,
  title={Token Shift Transformer for Video Classification},
  author={Hao Zhang, Yanbin Hao, Chong-Wah Ngo},
  journal={ACM Multimedia 2021},
}

Acknowledgement

Thanks for the following Github projects: