Awesome

In Defense of Image Pre-Training for Spatiotemporal Recognition

[NEW!] 2022/7/8 - Our paper has been accepted by ECCV 2022.

2022/5/5 - We have released the code and models.

Overview

This is a PyTorch/GPU implementation of the paper In Defense of Image Pre-Training for Spatiotemporal Recognition.

<div align="center"> <img src="./imgs/method.png" width = "800" alt="Architecture" align=center /> <br> <div style="color:orange; border-bottom: 2px solid #d9d9d9; display: inline-block; color: #999; padding: 10px;"> The overall Overview of Image Pre-Training & Spatiotemporal Fine-Tuning.. </div> </div>

The Image Pre-Training code is located in Image_Pre_Training, which is based on the timm repo.
The Spatiotemporal Finetuning code is a modification on the mmaction2. Installation and preparation follow that repo.
You can find the proposed STS 3D convolution in STS_Conv.

Content

Prerequisites

The code is built with following libraries:

python 3.8.5 or higher
PyTorch 1.10.0+cu113
torchvision 0.11.1+cu113
opencv-python 4.4.0
mmcv 1.4.6
mmaction 0.20.0+
decord

Video Dataset Preparation

We mainly focus on two widely-used video classification benchmarks Kinetics-400 and Something-Something V2.

Some notes before preparing the two datasets:

We decode the video online to reduce the cost of storage. In our experiments, the cpu bottleneck issue only appears when input frames are more than 8.
The frame resolution of Kinetics-400 we used is with a short-side 320. The number of train / validation data for our experiments is 240436 /19796. We also provide the train/val list.

We provide our annotation and data structure bellow for easy installation.

Generate the annotation.

The annotation usually includes train.txt, val.txt. The format of *.txt file is like:

video_1 label_1
video_2 label_2
video_3 label_3
...
video_N label_N

The pre-processed dataset is organized with the following structure:

datasets
  |_ Kinetics400
    |_ videos
    |  |_ video_0
    |  |_ video_1
       |_ ...  
       |_ video_N 
    |_ train.txt
    |_ val.txt

Model ZOO

Here we provide video dataset list and pretrained weights in this OneDrive or GoogleDrive.

ImageNet-1k

We provide ImageNet-1k pre-trained weights for five video models. All models are trained for 300 epochs. Please follow the scripts we provided to evaluate or finetune on video dataset.

Models/Configs	Resolution	Top-1	Checkpoints
ir-CSN50	224 * 224	78.8%	ckpt
R2plus1d34	224 * 224	79.6%	ckpt
SlowFast50-4x16	224 * 224	79.9%	ckpt
SlowFast50-8x8	224 * 224	79.1%	ckpt
Slowonly50	224 * 224	79.9%	ckpt
X3D-S	224 * 224	74.8%	ckpt

Kinetics-400

Here we provided the 50-epoch fine-tuning configs and checkpoints. We also include some 100-epochs checkpoints for a better performance but with a comparable computation.

Models/Configs	Resolution	Frames * Crops * Clips	50-epoch Top-1	100-epoch Top1	Checkpoints folder
ir-CSN50	256 * 256	32 * 3 * 10	76.8%	76.7%	ckpt
R2plus1d34	256 * 256	8 * 3 * 10	76.2%	Over training budget	ckpt
SlowFast50-4x16	256 * 256	32 * 3 * 10	76.2%	76.9%	ckpt
SlowFast50-8x8	256 * 256	32 * 3 * 10	77.2%	77.9%	ckpt
Slowonly50	256 * 256	8 * 3 * 10	75.7%	Over training budget	ckpt
X3D-S	192 * 192	13 * 3 * 10	72.5%	73.9%	ckpt

Something-Something V2

Models/Configs	Resolution	Frames * Crops * Clips	Top-1	Checkpoints
ir-CSN50	256 * 256	8 * 3 * 1	61.4%	ckpt
R2plus1d34	256 * 256	8 * 3 * 1	63.0%	ckpt
SlowFast50-4x16	256 * 256	32 * 3 * 1	57.2%	ckpt
Slowonly50	256 * 256	8 * 3 * 1	62.7%	ckpt
X3D-S	256 * 256	8 * 3 * 1	58.3%	ckpt

After downloading the checkpoints and putting them into the target path, you can fine-tune or test the models with corresponding configs following the instruction bellow.

Usage

Build

After having the above dependencies, run:

git clone https://github.com/UCSC-VLAA/Image-Pretraining-for-Video
cd Image_Pre_Training # first pretrain the 3D model on ImageNet
cd Spatiotemporal_Finetuning # then finetune the model on target video dataset

Pre-Training

We have provided some widely-used 3D model pre-trained weights that you can directly use for evaluation or fine-tuning.

After downloading the pre-training weights, for example, you can evaluate the CSN model on Imagenet by running:

bash scripts/csn/distributed_eval.sh [number of gpus]

The pre-training scripts for listed models are located in scripts. Before training the model on ImageNet, you should indicate some paths you would like to store the checkpoint your data path and --output. By default, we use wandb to show the curve.

For example, pre-train a CSN model on Imagenet:

bash scripts/csn/distributed_train.sh [number of gpus]

Fine-tuning

After pre-training, you can use the following command to fine-tune a video model.

Some Notes:

In the config file, change the load_from = [your pre-trained model path].
Simply setting the reshape_t or reshape_st in the model config to False can disable the STS Conv.

Then you can use the following command to fine-tune the models.

bash tools/dist_train.sh ${CONFIG_FILE} [optional arguments]

Example: train a CSN model on Kinetics-400 dataset with periodic validation.

bash tools/dist_train.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py [number of gpus] --validate

Testing

You can use the following command to test a model.

bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test a CSN model on Kinetics-400 dataset and dump the result to a json file.

bash tools/dist_test.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py \
    checkpoints/SOME_CHECKPOINT.pth [number of gpus] --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob

Acknowledgment

This repo is based on timm and mmaction2. Thanks the contributors of these repos!

Citation

@inproceedings{li2022videopretraining,
  title     = {In Defense of Image Pre-Training for Spatiotemporal Recognition}, 
  author    = {Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Yuille and Yuyin Zhou and Cihang Xie},
  booktitle = {ECCV},
  year      = {2022},
}