Awesome

Non-local Neural Networks for Video Classification

This code is a re-implementation of the video classification experiments in the paper Non-local Neural Networks. The code is developed based on the Caffe2 framework.

License

The code and the models in this repo are released under the CC-BY-NC 4.0 LICENSE.

Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{NonLocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}

Installation

Please find installation instructions for Caffe2 in INSTALL.md. We also suggest to check the Detectron installation and its issues if you had problems.

Pre-trained Models for Downloads

First go into the data folder:

cd data
mkdir pretrained_model
mkdir checkpoints

ImageNet pre-trained models

They can be downloaded from: pretrained_model.tar.gz. Extract the models to the current folder:

wget https://dl.fbaipublicfiles.com/video-nonlocal/pretrained_model.tar.gz
tar xzf pretrained_model.tar.gz

Dataset Preparation

Please read DATASET.md for downloading and preparing the Kinetics dataset.

Note: In this repo, we release the model which are trained with the same data as our paper.

Main Results

All the training scripts with ResNet-50 backbone are here:

cd scripts

We report the benchmarks with ResNet-50 backbone as below. All the numbers are obtain via fully-convolutional testing. All the models and training logs are available for download (some logs might not contain the fully-convolutional testing numbers):

<sub>script</sub>	<sub>input frames</sub>	<sub>freeze bn?</sub>	<sub>3D conv?</sub>	<sub>non-local?</sub>	<sub>top1</sub>	<sub>in paper</sub>	<sub>top5</sub>	<sub>model</sub>	<sub>logs</sub>
<sub>run_c2d_baseline_400k_32f.sh</sub>	32	-	-	-	72.0	<sub>71.8</sub>	90.0	`link`	`link`
<sub>run_c2d_nlnet_400k_32f.sh</sub>	32	-	-	Yes	73.9	<sub>73.8</sub>	91.0	`link`	`link`
<sub>run_i3d_baseline_400k_32f.sh</sub>	32	-	Yes	-	73.6	<sub>73.3</sub>	90.8	`link`	`link`
<sub>run_i3d_nlnet_400k_32f.sh</sub>	32	-	Yes	Yes	74.9	<sub>74.9</sub>	91.6	`link`	`link`
<sub>run_i3d_baseline_affine_400k_128f.sh</sub>	128	Yes	Yes	-	75.2	<sub>74.9</sub>	92.0	`link`	`link`
<sub>run_i3d_nlnet_affine_400k_128f.sh</sub>	128	Yes	Yes	Yes	76.5	<sub>76.5</sub>	92.7	`link`	`link`

Modifications for improving speed

Besides releasing the models following the exact parameter settings in the paper, we ablate a few different training settings which can significantly improve training/testing speed with almost the same performance.

Sparser sampling of inputs. We sample N frames with a stride of M frames (so covering N * M frames in the raw view). In the paper we used (N, M) = (32, 2) for short clips and (N, M) = (128, 1) for long clips. The following experiments use (N, M) = (8, 8) for short clips and (N, M) = (32, 4) for long clips. The temporal strides are adjusted accordingly such that the feature map sizes are unchanged in res2 to res5. This modification is to reduce data I/O, which can significantly improve the speed.

<sub>script</sub>	<sub>input frames</sub>	<sub>freeze bn?</sub>	<sub>3D conv?</sub>	<sub>non-local?</sub>	<sub>top1</sub>	<sub>top5</sub>	<sub>model</sub>	<sub>logs</sub>
<sub>run_c2d_baseline_400k.sh</sub>	8	-	-	-	71.9	90.0	`link`	`link`
<sub>run_c2d_nlnet_400k.sh</sub>	8	-	-	Yes	74.4	91.4	`link`	`link`
<sub>run_i3d_baseline_400k.sh</sub>	8	-	Yes	-	73.4	90.9	`link`	`link`
<sub>run_i3d_nlnet_400k.sh</sub>	8	-	Yes	Yes	74.7	91.6	`link`	`link`
<sub>run_i3d_baseline_affine_400k.sh</sub>	32	Yes	Yes	-	75.5	92.0	`link`	`link`
<sub>run_i3d_nlnet_affine_400k.sh</sub>	32	Yes	Yes	Yes	76.5	92.6	`link`	`link`

Fewer training iterations. With sparser sampling of inputs, we further reduce the training time by reducing the training iterations. Instead of training for 400K iterations in the paper, we can train our model with 300K iterations. This reduces training epochs by 25% without losing much performance.

<sub>script</sub>	<sub>input frames</sub>	<sub>freeze bn?</sub>	<sub>3D conv?</sub>	<sub>non-local?</sub>	<sub>top1</sub>	<sub>top5</sub>	<sub>model</sub>	<sub>logs</sub>
<sub>run_i3d_baseline_300k.sh</sub>	8	-	Yes	-	73.2	90.8	`link`	`link`

Training with fewer GPUs

The following two models were run by Xiaolong Wang with 4-GPU (GTX 1080) machines outside of Facebook after the internship. The training data is downloaded on 12/20/2017 (see DATASET.md), which misses some videos due to invalid urls. The training schedule is shorter (4-GPU 600k vs. 8-GPU 400k above). These changes lead to a slight accuracy drop.
We also provide training scripts/models with half iterations (300K with 4 GPUs) and less regularization. This baseline is fast and for sanity check: it only takes less than 3 days training on a machine with 4 GPUs (see "run_i3d_baseline_300k_4gpu.sh").

<sub>script</sub>	<sub>input frames</sub>	<sub>GPUs</sub>	<sub>freeze bn?</sub>	<sub>3D conv?</sub>	<sub>non-local?</sub>	<sub>top1</sub>	<sub>top5</sub>	<sub>model</sub>	<sub>logs</sub>
<sub>run_i3d_baseline_600k_4gpu.sh</sub>	8	4	-	Yes	-	73.0	90.4	`link`	`link`
<sub>run_i3d_baseline_300k_4gpu.sh</sub>	8	4	-	Yes	-	72.0	90.1	`link`	`link`

Script details

We now explain the scripts taking the ones trained with 3D convolutions, 400k iterations, 8GPUs, and sparser inputs as examples (in Modifications for improving speed).

The following script is the baseline i3d methods with ImageNet pre-trained network:
```
run_i3d_baseline_400k.sh
```
The following script trains the i3d model with 5 Non-local layers with ImageNet pre-trained network:
```
run_i3d_nlnet_400k.sh
```
To train the i3d Non-local Networks with longer clips (32-frame input), we first need to obtain the model trained from "run_i3d_baseline_400k.sh" as a pre-trained model. Then we convert the Batch Normalization layers into Affine layers by running:
```
cd ../process_data/convert_models
python modify_caffe2_ftvideo.py ../../data/checkpoints/run_i3d_baseline_400k/checkpoints/c2_model_iter400000.pkl  ../../data/pretrained_model/run_i3d_baseline_400k/affine_model_400k.pkl
```
Note that we have provided one example model (run_i3d_baseline_400k/affine_model_400k.pkl) in pretrained_model.tar.gz. Given this converted model, we run the script for training the i3d Non-local Networks with longer clips:
```
run_i3d_nlnet_affine_400k.sh
```

The models with ResNet-101 backbone can be trained by setting:

TRAIN.PARAMS_FILE ../data/pretrained_model/r101_pretrain_c2_model_iter450450_clean.pkl
MODEL.DEPTH 101
MODEL.VIDEO_ARC_CHOICE 4 # 3 for c2d, and 4 for i3d

Testing

The models are tested immediately after training. For each video, we sample 10 clips along the temporal dimension as in the paper. For each video clip, we resize the shorter side to 256 pixels and use 3 crops to cover the entire spatial size. We use fully-convolutional testing on each of the 256x256 crops. This is a slower approximation of the fully convolutional testing (on the variable full size, e.g., 256x320) done in the paper, which requires specific implementation not provided in this repo.

Taking the model trained with "run_i3d_nlnet_400k.sh" as an example, we can run testing by specifying:

TEST.TEST_FULLY_CONV True

as in the script:

run_test_multicrop.sh

Fine-tuning

The fine-tuning process is almost exactly the same as the training process. The only difference is that you need to first modify our Kinectis pre-trained model by removing the iteration number, momentum and last layer parameters, which is done with

process_data/convert_models/modify_blob_rm.py

Acknowledgement

The authors would like to thank Haoqi Fan for training the models and re-producing the results at FAIR with this code.