


This is official implementaion of paper "Token Shift Transformer for Video Classification". We achieve SOTA performance 80.40% on Kinetics-400 val. Paper link

<div align="center"> <img src="demo/tokshift.PNG" width="800px"/> </div>


July 11, 2021

  1. Directly decode video mp4 file during training/evaluation
  2. Change to adopt standarlize timm code-base.
  3. Performances are further improved than reported in paper version (average +0.5).

April 22, 2021

April 16, 2021

Model Zoo and Baselines

architecturebackbonepretrainRes & FramesGFLOPs x viewstop1config
ViT (Video)Base16ImgNet21k224 & 8134.7 x 3076.02 linkk400_vit_8x32_224.yml
TokShiftBase-16ImgNet21k224 & 8134.7 x 3077.28 linkk400_tokshift_div4_8x32_base_224.yml
TokShift (MR)Base16ImgNet21k256 & 8175.8 x 3077.68 linkk400_tokshift_div4_8x32_base_256.yml
TokShift (HR)Base16ImgNet21k384 & 8394.7 x 3078.14 linkk400_tokshift_div4_8x32_base_384.yml
TokShiftBase16ImgNet21k224 & 16268.5 x 3078.18 linkk400_tokshift_div4_16x32_base_224.yml
TokShift-Large (HR)Large16ImgNet21k384 & 81397.6 x 3079.83 linkk400_tokshift_div4_8x32_large_384.yml
TokShift-Large (HR)Large16ImgNet21k384 & 122096.4 x 3080.40 linkk400_tokshift_div4_12x32_large_384.yml

Below is trainig log, we use 3 views evaluation (instead of 30 views) during validation for time-saving.

<div align="center"> <img src="demo/trnlog.PNG" width="800px"/> </div>


Quick Start


  1. Download ImageNet-22k pretrained weights from Base16 and Large16.
  2. Prepare Kinetics-400 dataset organized in the following structure, trainValTest
|_ frames331_train
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|_ frames331_val
|  |_ [category name 0]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |
|  |_ [category name 1]
|  |  |_ [video name 0]
|  |  |  |_ img_00001.jpg
|  |  |  |_ img_00002.jpg
|  |  |  |_ ...
|  |  |
|  |  |_ [video name 1]
|  |  |   |_ img_00001.jpg
|  |  |   |_ img_00002.jpg
|  |  |   |_ ...
|  |  |_ ...
|  |_ ...
|_ trainValTest
   |_ train.txt
   |_ val.txt
  1. Using train-script (train.sh) to train k400
#!/usr/bin/env python
import os

cmd = "python -u main_ddp_shift_v3.py \
		--multiprocessing-distributed --world-size 1 --rank 0 \
		--dist-ur tcp:// \
		--tune_from pretrain/ViT-L_16_Img21.npz \
		--cfg config/custom/kinetics400/k400_tokshift_div4_12x32_large_384.yml"


Using test.sh (test.sh) to evaluate k400

#!/usr/bin/env python
import os
cmd = "python -u main_ddp_shift_v3.py \
        --multiprocessing-distributed --world-size 1 --rank 0 \
        --dist-ur tcp:// \
        --evaluate \
        --resume model_zoo/ViT-B_16_k400_dense_cls400_segs8x32_e18_lr0.1_B21_VAL224/best_vit_B8x32x224_k400.pth \
        --cfg config/custom/kinetics400/k400_vit_8x32_224.yml"


VideoNet is written and maintained by Dr. Hao Zhang and Dr. Yanbin Hao.


If you find TokShift-xfmr is useful in your research, please use the following BibTeX entry for citation.

  title={Token Shift Transformer for Video Classification},
  author={Hao Zhang, Yanbin Hao, Chong-Wah Ngo},
  journal={ACM Multimedia 2021},


Thanks for the following Github projects: