Home

Awesome

Efficient Video Transformers with Spatial-Temporal Token Selection

Official PyTorch implementation of STTS, from the following paper:

Efficient Video Transformers with Spatial-Temporal Token Selection, ECCV 2022.

Junke Wang<sup>*</sup>,Xitong Yang<sup>*</sup>, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang.

Fudan University, University of Maryland, BirenTech Research


<p align="center"> <img src="./imgs/teaser.png" width=100% height=100% class="center"> </p>

We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.

Model Zoo

MViT with STTS on Kinetics-400

nameacc@1FLOPsmodel
MViT-T<sup>0</sup><sub>0.9</sub>-S<sup>4</sup><sub>0.9</sub>78.156.4model
MViT-T<sup>0</sup><sub>0.8</sub>-S<sup>4</sup><sub>0.9</sub>77.947.2model
MViT-T<sup>0</sup><sub>0.6</sub>-S<sup>4</sup><sub>0.9</sub>77.538.1model
MViT-T<sup>0</sup><sub>0.5</sub>-S<sup>4</sup><sub>0.7</sub>76.623.3model
MViT-T<sup>0</sup><sub>0.4</sub>-S<sup>4</sup><sub>0.6</sub>75.612.1model

VideoSwin with STTS on Kinetics-400

nameacc@1FLOPsmodel
VideoSwin-T<sup>0</sup><sub>0.9</sub>81.9252.5model
VideoSwin-T<sup>0</sup><sub>0.8</sub>81.6223.4model
VideoSwin-T<sup>0</sup><sub>0.6</sub>81.4181.4model
VideoSwin-T<sup>0</sup><sub>0.5</sub>81.1121.6model
VideoSwin-T<sup>0</sup><sub>0.4</sub>80.791.4model

Installation

Please check MViT and VideoSwin for installation instructions and data preparation.

Training and Evaluation

MViT

For both training and evaluation with MViT as backbone, you could use:

cd MViT

python tools/run_net.py --cfg path_to_your_config

For example, to evaluate MViT-T<sup>0</sup><sub>0.6</sub>-S<sup>4</sup><sub>0.9</sub>, run:

python tools/run_net.py --cfg configs/Kinetics/t0_0.6_s4_0.9.yaml

VideoSwin

For training, you could use:

cd VideoSwin

bash tools/dist_train.sh path_to_your_config $NUM_GPUS --checkpoint path_to_your_checkpoint --validate --test-last

while for evaluation, you could use:

bash tools/dist_test.sh path_to_your_config path_to_your_checkpoint $NUM_GPUS --eval top_k_accuracy

For example, to evaluate VideoSwin-T<sup>0</sup><sub>0.9</sub> on a single node with 8 gpus, run:

cd VideoSwin

bash tools/dist_test.sh configs/Kinetics/t0_0.875.py ./checkpoints/t0_0.875.pth 8 --eval top_k_accuracy

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{wang2021efficient,
  title={Efficient video transformers with spatial-temporal token selection},
  author={Wang, Junke and Yang, Xitong and Li, Hengduo and Li, Liu and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={ECCV},
  year={2022}
}