Awesome

TeSTra: Real-time Online Video Detection with Temporal Smoothing Transformers

Introduction

This is a PyTorch implementation for our ECCV 2022 paper "Real-time Online Video Detection with Temporal Smoothing Transformers".

teaser

Environment

The code is developed with CUDA 10.2, Python >= 3.7.7, PyTorch >= 1.7.1
1. Clone the repo recursively.
```
git clone --recursive git@github.com:zhaoyue-zephyrus/TeSTra.git
```
2. [Optional but recommended] create a new conda environment.
```
conda create -n testra python=3.7.7
```
  And activate the environment.
```
conda activate testra
```
3. Install the requirements
```
pip install -r requirements.txt
```

Data Preparation

Pre-extracted Feature

You can directly download the pre-extracted feature (.zip) from the UTBox links below.

THUMOS'14

Description	backbone	pretrain	UTBox Link
frame label	N/A	N/A	link
RGB	ResNet-50	Kinetics-400	link
Flow (TV-L1)	BN-Inception	Kinetics-400	link
Flow (NVOF)	BN-Inception	Kinetics-400	link
RGB	ResNet-50	ANet v1.3	link
Flow (TV-L1)	ResNet-50	ANet v1.3	link

EK100

Description	backbone	pretrain	UTBox Link
action label	N/A	N/A	link
noun label	N/A	N/A	link
verb label	N/A	N/A	link
RGB	BN-Inception	IN-1k + EK100	link
Flow (TV-L1)	BN-Inception	IN-1k + EK100	link
Object	Faster-RCNN	MS-COCO + EK55	link

Note: The features are converted from RULSTM to be compatible with the codebase.
Note: Object feature is not used in TeSTRa. The feature is uploaded for completeness only.

Once the zipped files are downloaded, you are suggested to unzip them and follow to file organization (see below).

(Alterative) Static links

It may be easier to download from static links via wget for non-GUI systems. To do so, simply change the utbox link from https://utexas.box.com/s/xxxx to https://utexas.box.com/shared/static/xxxx.zip. Unfortunately, UTBox does not support customized url names. Therfore, to wget while keeping the name readable, please refer to the bash scripts provided in DATASET.md.

(Alternative) Prepare dataset from scratch

You can also try to prepare the datasets from scratch by yourself.

THUMOS14

For TH14, please refer to LSTR.

EK100

For EK100, please find more details at RULSTM.

Computing Optical Flow

I will release a pure-python version of DenseFlow in the near future. Will post a cross-link here once done.

Data Structure

If you want to use our dataloaders, please make sure to put the files as the following structure:

THUMOS'14 dataset:

$YOUR_PATH_TO_THUMOS_DATASET
├── rgb_kinetics_resnet50/
|   ├── video_validation_0000051.npy (of size L x 2048)
│   ├── ...
├── flow_kinetics_bninception/
|   ├── video_validation_0000051.npy (of size L x 1024)
|   ├── ...
├── target_perframe/
|   ├── video_validation_0000051.npy (of size L x 22)
|   ├── ...

EK100 dataset:

$YOUR_PATH_TO_EK_DATASET
├── rgb_kinetics_bninception/
|   ├── P01_01.npy (of size L x 2048)
│   ├── ...
├── flow_kinetics_bninception/
|   ├── P01_01.npy (of size L x 2048)
|   ├── ...
├── target_perframe/
|   ├── P01_01.npy (of size L x 3807)
|   ├── ...
├── noun_perframe/
|   ├── P01_01.npy (of size L x 301)
|   ├── ...
├── verb_perframe/
|   ├── P01_01.npy (of size L x 98)
|   ├── ...

Create softlinks of datasets:

cd TeSTra
ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
ln -s $YOUR_PATH_TO_EK_DATASET data/EK100

Training

The commands for training are as follows.

cd TeSTra/
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

For existing checkpoints, please refer to the next section.

Batch mode

Run the online inference in batch mode for performance benchmarking.

```
cd TeSTra/
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```

Stream mode

Run the online inference in stream mode to calculate runtime in the streaming setting.

```
cd TeSTra/
# Online inference in stream mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream
# The above one will take quite long over the entire dataset,
# If you only want to look at a particular video, attach an additional argument:
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream \
    DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
```

For more details on the difference between batch mode and stream mode, please check out LSTR.

Main Results and checkpoints

THUMOS14

method	kernel type	mAP (%)	config	checkpoint
LSTR (baseline)	Cross Attention	69.9	yaml	UTBox link
TeSTra	Laplace (α=e^-λ=0.97)	70.8	yaml	UTBox link
TeSTra	Box (α=e^-λ=1.0)	71.2	yaml	UTBox link
TeSTra (lite)	Box (α=e^-λ=1.0)	67.3	yaml	UTBox link

EK100

method	kernel type	verb (overall)	noun (overall)	action (overall)	config	checkpoint
TeSTra	Laplace (α=e^-λ=0.9)	30.8	35.8	17.6	yaml	UTBox link
TeSTra	Box (α=e^-λ=1.0)	31.4	33.9	17.0	yaml	UTBox link

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{zhao2022testra,
	title={Real-time Online Video Detection with Temporal Smoothing Transformers},
	author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
	booktitle={European Conference on Computer Vision (ECCV)},
	year={2022}
}

Contacts

For any question, feel free to raise an issue or drop me an email via yzhao [at] cs.utexas.edu

License

This project is licensed under the Apache-2.0 License.

Acknowledgements

This codebase is built upon LSTR.

The code snippet for evaluation on EK100 is borrowed from RULSTM.

Also, thanks to Mingze Xu for assistance to reproduce the feature on THUMOS'14.