Home

Awesome

Memory-and-Anticipation Transformer for Online Action Understanding

Introduction

This is a PyTorch implementation for our ICCV 2023 paper "Memory-and-Anticipation Transformer for Online Action Understanding".

network

Environment

Data Preparation

Pre-extracted Feature

You can directly download the pre-extracted feature (.zip) from the UTBox links provided by TeSTra.

(Alternative) Prepare dataset from scratch

You can also try to prepare the datasets from scratch by yourself.

THUMOS14 and TVSeries

For THUMOS14 and TVSeries, please refer to LSTR.

EK100

For EK100, please find more details at RULSTM.

Data Structure

  1. If you want to use our dataloaders, please make sure to put the files as the following structure:

    • THUMOS'14 dataset:

      $YOUR_PATH_TO_THUMOS_DATASET
      ├── rgb_kinetics_resnet50/
      |   ├── video_validation_0000051.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── video_validation_0000051.npy (of size L x 1024)
      |   ├── ...
      ├── target_perframe/
      |   ├── video_validation_0000051.npy (of size L x 22)
      |   ├── ...
      
    • TVSeries dataset:

         $YOUR_PATH_TO_TVSERIES_DATASET
         ├── rgb_kinetics_resnet50/
         |   ├── Breaking_Bad_ep1.npy (of size L x 2048)
         │   ├── ...
         ├── flow_kinetics_bninception/
         |   ├── Breaking_Bad_ep1.npy (of size L x 1024)
         |   ├── ...
         ├── target_perframe/
         |   ├── Breaking_Bad_ep1.npy (of size L x 31)
         |   ├── ...
      
    • EK100 dataset:

         $YOUR_PATH_TO_EK_DATASET
         ├── rgb_kinetics_bninception/
         |   ├── P01_01.npy (of size L x 1024)
         │   ├── ...
         ├── flow_kinetics_bninception/
         |   ├── P01_01.npy (of size L x 1024)
         |   ├── ...
         ├── target_perframe/
         |   ├── P01_01.npy (of size L x 3807)
         |   ├── ...
         ├── noun_perframe/
         |   ├── P01_01.npy (of size L x 301)
         |   ├── ...
         ├── verb_perframe/
         |   ├── P01_01.npy (of size L x 98)
         |   ├── ...
      
  2. Create softlinks of datasets:

    cd memory-and-anticipation-transformer
    ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
    ln -s $YOUR_PATH_TO_TVSERIES_DATASET data/TVSeries
    ln -s $YOUR_PATH_TO_EK_DATASET data/EK100
    

Training

The commands are as follows.

cd memory-and-anticipation-transformer
# Training from scratch
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

There are two kinds of evaluation methods in our code.

Main Results and checkpoints

THUMOS14

methodfeaturemAP (%)configcheckpoint
MATAnet v1.370.5yamlDownload
MATKinetics71.6yamlDownload

EK100

methodfeatureverb (overall)noun (overall)action (overall)configcheckpoint
MATRGB+FLOW35.038.819.5yamlDownload

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{wang2023memory,
               title={Memory-and-Anticipation Transformer for Online Action Understanding},
               author={Wang, Jiahao and Chen, Guo and Huang, Yifei and Wang, Limin and Lu, Tong},
               booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
               pages={13824--13835},
               year={2023}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgements

This codebase is built upon LSTR.

The code snippet for evaluation on EK100 is borrowed from TeSTra.

Also, thanks to Mingze Xu and Yue Zhao for assistance to reproduce the feature.