Home

Awesome

In Defense of Image Pre-Training for Spatiotemporal Recognition

[NEW!] 2022/7/8 - Our paper has been accepted by ECCV 2022.

2022/5/5 - We have released the code and models.

Overview

This is a PyTorch/GPU implementation of the paper In Defense of Image Pre-Training for Spatiotemporal Recognition.

<div align="center"> <img src="./imgs/method.png" width = "800" alt="Architecture" align=center /> <br> <div style="color:orange; border-bottom: 2px solid #d9d9d9; display: inline-block; color: #999; padding: 10px;"> The overall Overview of Image Pre-Training & Spatiotemporal Fine-Tuning.. </div> </div>

Content

Prerequisites

The code is built with following libraries:

Video Dataset Preparation

We mainly focus on two widely-used video classification benchmarks Kinetics-400 and Something-Something V2.

Some notes before preparing the two datasets:

  1. We decode the video online to reduce the cost of storage. In our experiments, the cpu bottleneck issue only appears when input frames are more than 8.

  2. The frame resolution of Kinetics-400 we used is with a short-side 320. The number of train / validation data for our experiments is 240436 /19796. We also provide the train/val list.

We provide our annotation and data structure bellow for easy installation.

Model ZOO

Here we provide video dataset list and pretrained weights in this OneDrive or GoogleDrive.

ImageNet-1k

We provide ImageNet-1k pre-trained weights for five video models. All models are trained for 300 epochs. Please follow the scripts we provided to evaluate or finetune on video dataset.

Models/ConfigsResolutionTop-1Checkpoints
ir-CSN50224 * 22478.8%ckpt
R2plus1d34224 * 22479.6%ckpt
SlowFast50-4x16224 * 22479.9%ckpt
SlowFast50-8x8224 * 22479.1%ckpt
Slowonly50224 * 22479.9%ckpt
X3D-S224 * 22474.8%ckpt

Kinetics-400

Here we provided the 50-epoch fine-tuning configs and checkpoints. We also include some 100-epochs checkpoints for a better performance but with a comparable computation.

Models/ConfigsResolutionFrames * Crops * Clips50-epoch Top-1100-epoch Top1Checkpoints folder
ir-CSN50256 * 25632 * 3 * 1076.8%76.7%ckpt
R2plus1d34256 * 2568 * 3 * 1076.2%Over training budgetckpt
SlowFast50-4x16256 * 25632 * 3 * 1076.2%76.9%ckpt
SlowFast50-8x8256 * 25632 * 3 * 1077.2%77.9%ckpt
Slowonly50256 * 2568 * 3 * 1075.7%Over training budgetckpt
X3D-S192 * 19213 * 3 * 1072.5%73.9%ckpt

Something-Something V2

Models/ConfigsResolutionFrames * Crops * ClipsTop-1Checkpoints
ir-CSN50256 * 2568 * 3 * 161.4%ckpt
R2plus1d34256 * 2568 * 3 * 163.0%ckpt
SlowFast50-4x16256 * 25632 * 3 * 157.2%ckpt
Slowonly50256 * 2568 * 3 * 162.7%ckpt
X3D-S256 * 2568 * 3 * 158.3%ckpt

After downloading the checkpoints and putting them into the target path, you can fine-tune or test the models with corresponding configs following the instruction bellow.

Usage

Build

After having the above dependencies, run:

git clone https://github.com/UCSC-VLAA/Image-Pretraining-for-Video
cd Image_Pre_Training # first pretrain the 3D model on ImageNet
cd Spatiotemporal_Finetuning # then finetune the model on target video dataset

Pre-Training

We have provided some widely-used 3D model pre-trained weights that you can directly use for evaluation or fine-tuning.

After downloading the pre-training weights, for example, you can evaluate the CSN model on Imagenet by running:

bash scripts/csn/distributed_eval.sh [number of gpus]

The pre-training scripts for listed models are located in scripts. Before training the model on ImageNet, you should indicate some paths you would like to store the checkpoint your data path and --output. By default, we use wandb to show the curve.

For example, pre-train a CSN model on Imagenet:

bash scripts/csn/distributed_train.sh [number of gpus]

Fine-tuning

After pre-training, you can use the following command to fine-tune a video model.

Some Notes:

Then you can use the following command to fine-tune the models.

bash tools/dist_train.sh ${CONFIG_FILE} [optional arguments]

Example: train a CSN model on Kinetics-400 dataset with periodic validation.

bash tools/dist_train.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py [number of gpus] --validate 

Testing

You can use the following command to test a model.

bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test a CSN model on Kinetics-400 dataset and dump the result to a json file.

bash tools/dist_test.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py \
    checkpoints/SOME_CHECKPOINT.pth [number of gpus] --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob 

Acknowledgment

This repo is based on timm and mmaction2. Thanks the contributors of these repos!

Citation

@inproceedings{li2022videopretraining,
  title     = {In Defense of Image Pre-Training for Spatiotemporal Recognition}, 
  author    = {Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Yuille and Yuyin Zhou and Cihang Xie},
  booktitle = {ECCV},
  year      = {2022},
}