Home

Awesome

Non-local Neural Networks for Video Classification

This code is a re-implementation of the video classification experiments in the paper Non-local Neural Networks. The code is developed based on the Caffe2 framework.

<div align="center"> <img src="data/nlnet.jpg" width="700px" /> </div>

License

The code and the models in this repo are released under the CC-BY-NC 4.0 LICENSE.

Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry.

@article{NonLocal2018,
  author =   {Xiaolong Wang and Ross Girshick and Abhinav Gupta and Kaiming He},
  title =    {Non-local Neural Networks},
  journal =  {CVPR},
  year =     {2018}
}

Installation

Please find installation instructions for Caffe2 in INSTALL.md. We also suggest to check the Detectron installation and its issues if you had problems.

Pre-trained Models for Downloads

First go into the data folder:

cd data
mkdir pretrained_model
mkdir checkpoints

ImageNet pre-trained models

They can be downloaded from: pretrained_model.tar.gz. Extract the models to the current folder:

wget https://dl.fbaipublicfiles.com/video-nonlocal/pretrained_model.tar.gz
tar xzf pretrained_model.tar.gz

Dataset Preparation

Please read DATASET.md for downloading and preparing the Kinetics dataset.

Note: In this repo, we release the model which are trained with the same data as our paper.

Main Results

All the training scripts with ResNet-50 backbone are here:

cd scripts

We report the benchmarks with ResNet-50 backbone as below. All the numbers are obtain via fully-convolutional testing. All the models and training logs are available for download (some logs might not contain the fully-convolutional testing numbers):

<sub>script</sub><sub>input frames</sub><sub>freeze bn?</sub><sub>3D conv?</sub><sub>non-local?</sub><sub>top1</sub><sub>in paper</sub><sub>top5</sub><sub>model</sub><sub>logs</sub>
<sub>run_c2d_baseline_400k_32f.sh</sub>32---72.0<sub>71.8</sub>90.0linklink
<sub>run_c2d_nlnet_400k_32f.sh</sub>32--Yes73.9<sub>73.8</sub>91.0linklink
<sub>run_i3d_baseline_400k_32f.sh</sub>32-Yes-73.6<sub>73.3</sub>90.8linklink
<sub>run_i3d_nlnet_400k_32f.sh</sub>32-YesYes74.9<sub>74.9</sub>91.6linklink
<sub>run_i3d_baseline_affine_400k_128f.sh</sub>128YesYes-75.2<sub>74.9</sub>92.0linklink
<sub>run_i3d_nlnet_affine_400k_128f.sh</sub>128YesYesYes76.5<sub>76.5</sub>92.7linklink

Modifications for improving speed

Besides releasing the models following the exact parameter settings in the paper, we ablate a few different training settings which can significantly improve training/testing speed with almost the same performance.

<sub>script</sub><sub>input frames</sub><sub>freeze bn?</sub><sub>3D conv?</sub><sub>non-local?</sub><sub>top1</sub><sub>top5</sub><sub>model</sub><sub>logs</sub>
<sub>run_c2d_baseline_400k.sh</sub>8---71.990.0linklink
<sub>run_c2d_nlnet_400k.sh</sub>8--Yes74.491.4linklink
<sub>run_i3d_baseline_400k.sh</sub>8-Yes-73.490.9linklink
<sub>run_i3d_nlnet_400k.sh</sub>8-YesYes74.791.6linklink
<sub>run_i3d_baseline_affine_400k.sh</sub>32YesYes-75.592.0linklink
<sub>run_i3d_nlnet_affine_400k.sh</sub>32YesYesYes76.592.6linklink
<sub>script</sub><sub>input frames</sub><sub>freeze bn?</sub><sub>3D conv?</sub><sub>non-local?</sub><sub>top1</sub><sub>top5</sub><sub>model</sub><sub>logs</sub>
<sub>run_i3d_baseline_300k.sh</sub>8-Yes-73.290.8linklink

Training with fewer GPUs

<sub>script</sub><sub>input frames</sub><sub>GPUs</sub><sub>freeze bn?</sub><sub>3D conv?</sub><sub>non-local?</sub><sub>top1</sub><sub>top5</sub><sub>model</sub><sub>logs</sub>
<sub>run_i3d_baseline_600k_4gpu.sh</sub>84-Yes-73.090.4linklink
<sub>run_i3d_baseline_300k_4gpu.sh</sub>84-Yes-72.090.1linklink

Script details

We now explain the scripts taking the ones trained with 3D convolutions, 400k iterations, 8GPUs, and sparser inputs as examples (in Modifications for improving speed).

  1. The following script is the baseline i3d methods with ImageNet pre-trained network:

    run_i3d_baseline_400k.sh
    
  2. The following script trains the i3d model with 5 Non-local layers with ImageNet pre-trained network:

    run_i3d_nlnet_400k.sh
    
  3. To train the i3d Non-local Networks with longer clips (32-frame input), we first need to obtain the model trained from "run_i3d_baseline_400k.sh" as a pre-trained model. Then we convert the Batch Normalization layers into Affine layers by running:

    cd ../process_data/convert_models
    python modify_caffe2_ftvideo.py ../../data/checkpoints/run_i3d_baseline_400k/checkpoints/c2_model_iter400000.pkl  ../../data/pretrained_model/run_i3d_baseline_400k/affine_model_400k.pkl
    

    Note that we have provided one example model (run_i3d_baseline_400k/affine_model_400k.pkl) in pretrained_model.tar.gz. Given this converted model, we run the script for training the i3d Non-local Networks with longer clips:

    run_i3d_nlnet_affine_400k.sh
    
  4. The models with ResNet-101 backbone can be trained by setting:

    TRAIN.PARAMS_FILE ../data/pretrained_model/r101_pretrain_c2_model_iter450450_clean.pkl
    MODEL.DEPTH 101
    MODEL.VIDEO_ARC_CHOICE 4 # 3 for c2d, and 4 for i3d
    

Testing

The models are tested immediately after training. For each video, we sample 10 clips along the temporal dimension as in the paper. For each video clip, we resize the shorter side to 256 pixels and use 3 crops to cover the entire spatial size. We use fully-convolutional testing on each of the 256x256 crops. This is a slower approximation of the fully convolutional testing (on the variable full size, e.g., 256x320) done in the paper, which requires specific implementation not provided in this repo.

Taking the model trained with "run_i3d_nlnet_400k.sh" as an example, we can run testing by specifying:

TEST.TEST_FULLY_CONV True

as in the script:

run_test_multicrop.sh

Fine-tuning

The fine-tuning process is almost exactly the same as the training process. The only difference is that you need to first modify our Kinectis pre-trained model by removing the iteration number, momentum and last layer parameters, which is done with

process_data/convert_models/modify_blob_rm.py

Acknowledgement

The authors would like to thank Haoqi Fan for training the models and re-producing the results at FAIR with this code.