Home

Awesome

ViS4mer

This is an official pytorch implementation of our ECCV 2022 paper Long Movie Clip Classification with State-Space Video Models. In this repository, we provide PyTorch code for training and testing our proposed ViS4mer model. ViS4mer is an efficient video recognition model that achieves state-of-the-art results on several long-range video understanding bechmarks such as LVU, Breakfast, and COIN.

If you find ViS4mer useful in your research, please use the following BibTeX entry for citation.

@article{islam2022long,
  title={Long movie clip classification with state-space video models},
  author={Islam, Md Mohaiminul and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2204.01692},
  year={2022}
}

Installation

This repository requires Python 3.8+ and Pytorch 1.9+.

conda create --name py38 python=3.8
conda activate py38
cd extensions/cauchy
python setup.py install

For more details of installation regarding S4 layer, please follow this.

Demo

You can use the model as follows:

import torch
from models import ViS4mer

model = ViS4mer(d_input=1024, l_max=2048, d_output=10, d_model=1024, n_layers=3)
model.cuda()

inputs = torch.randn(32, 2048, 1024).cuda() #[batch_size, seq_len, input_dim]
outputs = model(inputs)  #[32, 10]

Run on LVU dataset

extract_features/extract_features_lvu_vit.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_lvu.py

Run on Breakfast dataset

extract_features/extract_features_breakfast_swin_train.py
extract_features/extract_features_breakfast_swin_test.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_breakfast.py

Run on COIN dataset

extract_features/extract_features_coin_swin_train.py
extract_features/extract_features_coin_swin_test.py
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_coin.py