Home

Awesome

TeSTra: Real-time Online Video Detection with Temporal Smoothing Transformers

Introduction

This is a PyTorch implementation for our ECCV 2022 paper "Real-time Online Video Detection with Temporal Smoothing Transformers".

teaser

Environment

Data Preparation

Pre-extracted Feature

You can directly download the pre-extracted feature (.zip) from the UTBox links below.

THUMOS'14

DescriptionbackbonepretrainUTBox Link
frame labelN/AN/Alink
RGBResNet-50Kinetics-400link
Flow (TV-L1)BN-InceptionKinetics-400link
Flow (NVOF)BN-InceptionKinetics-400link
RGBResNet-50ANet v1.3link
Flow (TV-L1)ResNet-50ANet v1.3link

EK100

DescriptionbackbonepretrainUTBox Link
action labelN/AN/Alink
noun labelN/AN/Alink
verb labelN/AN/Alink
RGBBN-InceptionIN-1k + EK100link
Flow (TV-L1)BN-InceptionIN-1k + EK100link
ObjectFaster-RCNNMS-COCO + EK55link

Once the zipped files are downloaded, you are suggested to unzip them and follow to file organization (see below).

(Alterative) Static links

It may be easier to download from static links via wget for non-GUI systems. To do so, simply change the utbox link from https://utexas.box.com/s/xxxx to https://utexas.box.com/shared/static/xxxx.zip. Unfortunately, UTBox does not support customized url names. Therfore, to wget while keeping the name readable, please refer to the bash scripts provided in DATASET.md.

(Alternative) Prepare dataset from scratch

You can also try to prepare the datasets from scratch by yourself.

THUMOS14

For TH14, please refer to LSTR.

EK100

For EK100, please find more details at RULSTM.

Computing Optical Flow

I will release a pure-python version of DenseFlow in the near future. Will post a cross-link here once done.

Data Structure

  1. If you want to use our dataloaders, please make sure to put the files as the following structure:

    • THUMOS'14 dataset:

      $YOUR_PATH_TO_THUMOS_DATASET
      ├── rgb_kinetics_resnet50/
      |   ├── video_validation_0000051.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── video_validation_0000051.npy (of size L x 1024)
      |   ├── ...
      ├── target_perframe/
      |   ├── video_validation_0000051.npy (of size L x 22)
      |   ├── ...
      
    • EK100 dataset:

      $YOUR_PATH_TO_EK_DATASET
      ├── rgb_kinetics_bninception/
      |   ├── P01_01.npy (of size L x 2048)
      │   ├── ...
      ├── flow_kinetics_bninception/
      |   ├── P01_01.npy (of size L x 2048)
      |   ├── ...
      ├── target_perframe/
      |   ├── P01_01.npy (of size L x 3807)
      |   ├── ...
      ├── noun_perframe/
      |   ├── P01_01.npy (of size L x 301)
      |   ├── ...
      ├── verb_perframe/
      |   ├── P01_01.npy (of size L x 98)
      |   ├── ...
      
  2. Create softlinks of datasets:

    cd TeSTra
    ln -s $YOUR_PATH_TO_THUMOS_DATASET data/THUMOS
    ln -s $YOUR_PATH_TO_EK_DATASET data/EK100
    

Training

The commands for training are as follows.

cd TeSTra/
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES
# Finetuning from a pretrained model
python tools/train_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT

Online Inference

For existing checkpoints, please refer to the next section.

Batch mode

Run the online inference in batch mode for performance benchmarking.

```
cd TeSTra/
# Online inference in batch mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE batch
```

Stream mode

Run the online inference in stream mode to calculate runtime in the streaming setting.

```
cd TeSTra/
# Online inference in stream mode
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream
# The above one will take quite long over the entire dataset,
# If you only want to look at a particular video, attach an additional argument:
python tools/test_net.py --config_file $PATH_TO_CONFIG_FILE --gpu $CUDA_VISIBLE_DEVICES \
    MODEL.CHECKPOINT $PATH_TO_CHECKPOINT MODEL.LSTR.INFERENCE_MODE stream \
    DATA.TEST_SESSION_SET "['$VIDEO_NAME']"
```

For more details on the difference between batch mode and stream mode, please check out LSTR.

Main Results and checkpoints

THUMOS14

methodkernel typemAP (%)configcheckpoint
LSTR (baseline)Cross Attention69.9yamlUTBox link
TeSTraLaplace (α=e^-λ=0.97)70.8yamlUTBox link
TeSTraBox (α=e^-λ=1.0)71.2yamlUTBox link
TeSTra (lite)Box (α=e^-λ=1.0)67.3yamlUTBox link

EK100

methodkernel typeverb (overall)noun (overall)action (overall)configcheckpoint
TeSTraLaplace (α=e^-λ=0.9)30.835.817.6yamlUTBox link
TeSTraBox (α=e^-λ=1.0)31.433.917.0yamlUTBox link

Citations

If you are using the data/code/model provided here in a publication, please cite our paper:

@inproceedings{zhao2022testra,
	title={Real-time Online Video Detection with Temporal Smoothing Transformers},
	author={Zhao, Yue and Kr{\"a}henb{\"u}hl, Philipp},
	booktitle={European Conference on Computer Vision (ECCV)},
	year={2022}
}

Contacts

For any question, feel free to raise an issue or drop me an email via yzhao [at] cs.utexas.edu

License

This project is licensed under the Apache-2.0 License.

Acknowledgements

This codebase is built upon LSTR.

The code snippet for evaluation on EK100 is borrowed from RULSTM.

Also, thanks to Mingze Xu for assistance to reproduce the feature on THUMOS'14.