Home

Awesome

[CVPR 2022] Stand-Alone Inter-Frame Attention in Video Models

This repository includes the SIFA codes and related configurations. The architecture SIFA-Net and SIFA-Transformer is implemented with python in PyTorch framework. And the kernel of SIFA operation is programmed in C++ with CUDA Library.

The related pre-trained model weigths will be released soon.

Update

Contents:

Paper Introduction

<div align=center> <img src="https://raw.githubusercontent.com/FuchenUSTC/SIFA/master/pic/abstract.JPG" width="300" alt="image"/> <img src="https://raw.githubusercontent.com/FuchenUSTC/SIFA/master/pic/framework.JPG" width="475" alt="image"/> </div>

Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset.

Required Environment

To guarantee the success of compiling SIFA cuda kernel, the nvcc cuda compiler should be installed in the environment. We have integrated the complete running environment into a docker image and will release it on DockerHub in the future.

Compiling of SIFA

cd ./cuda_sifa

Then check the compilation paramter -gencode=arch and code in the setup.py to match the GPU type, e.g., sm_70 for Tesla V100. Then, run

bash make.sh

If the compilation is successful, the C++ extention file _ext.cpython-38-x86_64-linux-gnu.so will be generated in the ./cuda_sifa folder. Copy it to the main directory.

cp _ext.cpython-38-x86_64-linux-gnu.so ../

Training of SIFA

If the frame data has been prepared, please run

python -m torch.distributed.launch --nproc_per_node=4 train_val_3d.py --config_file=settings/c2d_sifa_resnet50-1x16x1.k400.yml

or

python -m torch.distributed.launch --nproc_per_node=4 train_val_3d.py --config_file=settings/c2d_sifa_swin-b-1x64x2.k400.yml

for the training of SIFA-Net or SIFA-Transformer.

The related training configuration files can be checked in the .yml files in the folder of ./base_config/ and ./settings.

Citation

If you use these models in your research, please cite:

@inproceedings{Long:CVPR22,
  title={Stand-Alone Inter-Frame Attention in Video Models},
  author={Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo and Tao Mei},
  booktitle={CVPR},
  year={2022}
}