Awesome

Memory-and-Anticipation Transformer for Online Action Understanding

Introduction

This is a PyTorch implementation for our ICCV 2023 paper "Memory-and-Anticipation Transformer for Online Action Understanding".

network

Environment

The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1
1. [Optional but recommended] create a new conda environment.
```
conda create -n mat python=3.7.7
```
  And activate the environment.
```
conda activate mat
```
2. Install the requirements
```
pip install -r requirements.txt
```

Data Preparation

Pre-extracted Feature

You can directly download the pre-extracted feature (.zip) from the UTBox links provided by TeSTra.

(Alternative) Prepare dataset from scratch

You can also try to prepare the datasets from scratch by yourself.

THUMOS14 and TVSeries

For THUMOS14 and TVSeries, please refer to LSTR.

EK100

For EK100, please find more details at RULSTM.

Data Structure

If you want to use our dataloaders, please make sure to put the files as the following structure:

THUMOS'14 dataset:

$YOUR_PATH_TO_THUMOS_DATASET
├── rgb_kinetics_resnet50/
|   ├── video_validation_0000051.npy (of size L x 2048)
│   ├── ...
├── flow_kinetics_bninception/
|   ├── video_validation_0000051.npy (of size L x 1024)
|   ├── ...
├── target_perframe/
|   ├── video_validation_0000051.npy (of size L x 22)
|   ├── ...

TVSeries dataset:

   $YOUR_PATH_TO_TVSERIES_DATASET
   ├── rgb_kinetics_resnet50/
   |   ├── Breaking_Bad_ep1.npy (of size L x 2048)
   │   ├── ...
   ├── flow_kinetics_bninception/
   |   ├── Breaking_Bad_ep1.npy (of size L x 1024)
   |   ├── ...
   ├── target_perframe/
   |   ├── Breaking_Bad_ep1.npy (of size L x 31)
   |   ├── ...

EK100 dataset:

   $YOUR_PATH_TO_EK_DATASET
   ├── rgb_kinetics_bninception/
   |   ├── P01_01.npy (of size L x 1024)
   │   ├── ...
   ├── flow_kinetics_bninception/
   |   ├── P01_01.npy (of size L x 1024)
   |   ├── ...
   ├── target_perframe/
   |   ├── P01_01.npy (of size L x 3807)
   |   ├── ...
   ├── noun_perframe/
   |   ├── P01_01.npy (of size L x 301)
   |   ├── ...
   ├── verb_perframe/
   |   ├── P01_01.npy (of size L x 98)
   |   ├── ...

Create softlinks of datasets:

cd memory-and-anticipation-transformer
ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
ln -s $YOUR_PATH_TO_TVSERIES_DATASET data/TVSeries
ln -s $YOUR_PATH_TO_EK_DATASET data/EK100

Training

The commands are as follows.

cd memory-and-anticipation-transformer
# Training from scratch
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

There are two kinds of evaluation methods in our code.

First, you can use the config SOLVER.PHASES "['train', 'test']" during training. This process devides each test video into non-overlapping samples, and makes prediction on the all the frames in the short-term memory as if they were the latest frame. Note that this evaluation result is not the final performance, since (1) for most of the frames, their short-term memory is not fully utlized and (2) for simplicity, samples in the boundaries are mostly ignored.
```
cd memory-and-anticipation-transformer
# Inference along with training
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    SOLVER.PHASES "['train', 'test']"
```
Second, you could run the online inference in batch mode. This process evaluates all video frames by considering each of them as the latest frame and filling the long- and short-term memories by tracing back in time. Note that this evaluation result matches the numbers reported in the paper. On the other hand, this mode can run faster when you use a large batch size, and we recomand to use it for performance benchmarking.
```
cd memory-and-anticipation-transformer
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```

Main Results and checkpoints

THUMOS14

method	feature	mAP (%)	config	checkpoint
MAT	Anet v1.3	70.5	yaml	Download
MAT	Kinetics	71.6	yaml	Download

EK100

method	feature	verb (overall)	noun (overall)	action (overall)	config	checkpoint
MAT	RGB+FLOW	35.0	38.8	19.5	yaml	Download

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{wang2023memory,
               title={Memory-and-Anticipation Transformer for Online Action Understanding},
               author={Wang, Jiahao and Chen, Guo and Huang, Yifei and Wang, Limin and Lu, Tong},
               booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
               pages={13824--13835},
               year={2023}
}

License

This project is licensed under the Apache-2.0 License.

Acknowledgements

This codebase is built upon LSTR.

The code snippet for evaluation on EK100 is borrowed from TeSTra.

Also, thanks to Mingze Xu and Yue Zhao for assistance to reproduce the feature.