Awesome

Efficient Video Transformers with Spatial-Temporal Token Selection

Official PyTorch implementation of STTS, from the following paper:

Efficient Video Transformers with Spatial-Temporal Token Selection, ECCV 2022.

Junke Wang*,Xitong Yang*, Hengduo Li, Li Liu, Zuxuan Wu, Yu-Gang Jiang.

Fudan University, University of Maryland, BirenTech Research

We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.

Model Zoo

MViT with STTS on Kinetics-400

name	acc@1	FLOPs	model
MViT-T<sup>0</sup><sub>0.9</sub>-S<sup>4</sup><sub>0.9</sub>	78.1	56.4	model
MViT-T<sup>0</sup><sub>0.8</sub>-S<sup>4</sup><sub>0.9</sub>	77.9	47.2	model
MViT-T<sup>0</sup><sub>0.6</sub>-S<sup>4</sup><sub>0.9</sub>	77.5	38.1	model
MViT-T<sup>0</sup><sub>0.5</sub>-S<sup>4</sup><sub>0.7</sub>	76.6	23.3	model
MViT-T<sup>0</sup><sub>0.4</sub>-S<sup>4</sup><sub>0.6</sub>	75.6	12.1	model

VideoSwin with STTS on Kinetics-400

name	acc@1	FLOPs	model
VideoSwin-T<sup>0</sup><sub>0.9</sub>	81.9	252.5	model
VideoSwin-T<sup>0</sup><sub>0.8</sub>	81.6	223.4	model
VideoSwin-T<sup>0</sup><sub>0.6</sub>	81.4	181.4	model
VideoSwin-T<sup>0</sup><sub>0.5</sub>	81.1	121.6	model
VideoSwin-T<sup>0</sup><sub>0.4</sub>	80.7	91.4	model

Installation

Please check MViT and VideoSwin for installation instructions and data preparation.

Training and Evaluation

MViT

For both training and evaluation with MViT as backbone, you could use:

cd MViT

python tools/run_net.py --cfg path_to_your_config

For example, to evaluate MViT-T00.6-S40.9, run:

python tools/run_net.py --cfg configs/Kinetics/t0_0.6_s4_0.9.yaml

VideoSwin

For training, you could use:

cd VideoSwin

bash tools/dist_train.sh path_to_your_config $NUM_GPUS --checkpoint path_to_your_checkpoint --validate --test-last

while for evaluation, you could use:

bash tools/dist_test.sh path_to_your_config path_to_your_checkpoint $NUM_GPUS --eval top_k_accuracy

For example, to evaluate VideoSwin-T00.9 on a single node with 8 gpus, run:

cd VideoSwin

bash tools/dist_test.sh configs/Kinetics/t0_0.875.py ./checkpoints/t0_0.875.pth 8 --eval top_k_accuracy

License

This project is released under the MIT license. Please see the LICENSE file for more information.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{wang2021efficient,
  title={Efficient video transformers with spatial-temporal token selection},
  author={Wang, Junke and Yang, Xitong and Li, Hengduo and Li, Liu and Wu, Zuxuan and Jiang, Yu-Gang},
  booktitle={ECCV},
  year={2022}
}