Home

Awesome

The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction

<a href="https://alexandrosstergiou.github.io/project_pages/TemPr/index.html">[Project page šŸŒ]</a> <a href="http://arxiv.org/abs/2204.13340">[ArXiv preprint šŸ“ƒ]</a> <a href="https://youtu.be/dcmd8U47BT8">[Video šŸŽžļø]</a>

supported versions Library GitHub license

This is the code implementation for the CVPR'23 paper <a href="http://arxiv.org/abs/2204.13340">The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction</a>.

Abstract

Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed <b>Tem</b>poral <b>Pr</b>ogressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations.

<p align="center"> <img src="./figures/TemPr_h_back_hl.png" width="700" height="370" /> </p>

Dependencies

Ensure that the following packages are installed in your machine:

You can install the available PyPi packages with the command below:

$ pip install coloredlogs dataset2database einops ffmpeg-python imgaug opencv-python ptflops torch torchvision youtube-dl

and compile the adaPool package as:

$ git clone https://github.com/alexandrosstergiou/adaPool.git && cd adaPool-master/pytorch && make install
--- (optional) ---
$ make test

Datasets

A custom format is used for the train/val label files of each datasets:

labelyoutube_id/idtime_start(optional)time_end(optional)split

This can be done through the scripts provided in labels

We have tested our code over the following datasets:

Videos and image-based datasets

Based on the format that the dataset is stored on disk two options are supported by the repo:

By default it is assumed that the data are in video format however, you can overwrite this by setting the use_frames call argument to True/true.

Data directory format

We assume a fixed directory formatting that should be of the following structure:

<data>
|
ā””ā”€ā”€ā”€<dataset>
        |
        ā””ā”€ā”€ā”€ <class_i>
        ā”‚     ā”‚
        ā”‚     ā”‚ā”€ā”€ā”€ <video_id_j>
        ā”‚     ā”‚         (for datasets w/ videos saved as frames)
        ā”‚     ā”‚         ā”‚
        ā”‚     ā”‚         ā”‚ā”€ā”€ā”€ frame1.jpg
        ā”‚     ā”‚         ā””ā”€ā”€ā”€ framen.jpg
        ā”‚     ā”‚    
        ā”‚     ā”‚ā”€ā”€ā”€ <video_id_j+1>
        ā”‚     ā”‚         (for datasets w/ videos saved as frames)
        ā”‚     ā”‚         ā”‚
        ā”‚     ā”‚         ā”‚ā”€ā”€ā”€ frame1.jpg
        ā”‚     ā”‚         ā””ā”€ā”€ā”€ framen.jpg
       ...   ...

Usage

Training for each of the datasets is done through the homonym .yaml configuration scripts in configs.

You can also use the argument parsers in train.py and inference.py for custom arguments.

Examples

Train on UCF-101 with observation ratio 0.3, 3 scales, with movinet backbone, with the pretrained UCF-101 backbone checkpoint stored in weights, and over 4 gpus:

python train.py --video_per 0.3 --num_samplers 3 --gpus 0 1 2 3 --precision mixed --dataset UCF-101 --frame_size 224 --batch_size 64 --data_dir data/UCF-101/ --label_dir /labels/UCF-101 --workers 16 --backbone movinet --end_epoch 70 --pretrained_dir weights/UCF-101/movinet_ada_best.pth

Run inference over something-something v2 with TemPr and adaptive ensemble over a single gpu with checkpoint file my_chckpt.pth:

python inference.py --config config/inference/smthng-smthng/config.yml --head TemPr_h --pool ada --gpus 0 --pretrained_dir my_chckpt.pth

Calling arguments (for both train.py & inference.py)

The following arguments are used and can be included at the parser of any training script.

Argument namefunctionality
debug-modeBoolean for debugging messages. Useful for custom implementations/datasets.
datasetString for the name of the dataset. used in order to obtain the respective configurations.
data_dirString for the directory to load data from.
data_dirString for the directory to load the train and val splits (should be train.csv and val.csv).
clip-lengthInteger determining the number of frames to be used for each video.
clip-sizeTuple for the spatial size (height x width) of each frame.
backboneString for the name of the feature extractor network.
accum_gradsInteger for the number of iterations passed to run backwards. Set to 1 to not use gradient accumulation.
use_framesBoolean flag. When set to True the dataset directory should be a folder of .jpg images. Alternatively, video files.
headString for the name of the attention tower network. Only TemPr_h can be currently used.
poolString for the predictor aggregation method to be used.
gpusList for the number of GPUs to be used.
pretrained-3dString for .pth filepath the case that the weights are to be initialised from some previously trained model. As a non-strict weight loading implementation exists to remove certain works from the state_dict keys.
configString for the .yaml configuration file to be used. If arguments that are part of the configuration path are passed by the user, they will be selected over the YAML ones.

Checkpoints

UCF-101

Backbone$\rho=0.1$$\rho=0.2$$\rho=0.3$$\rho=0.4$$\rho=0.5$$\rho=0.6$$\rho=0.7$$\rho=0.8$$\rho=0.9$
x3dchkpchkpchkpchkpchkpchkpchkpchkpchkp
movinetchkpchkpchkpchkpchkpchkpchkpchkpchkp

SSsub21

Backbone$\rho=0.1$$\rho=0.2$$\rho=0.3$$\rho=0.5$$\rho=0.7$$\rho=0.9$
movinetchkpchkpchkpchkpchkpchkp

Citation

@inproceedings{stergiou2023wisdom,
    title = {The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction},
    author = {Stergiou, Alexandros and Damen, Dima},
    booktitle = {IEEE/CVF Computer Vision and Pattern Recognition (CVPR)},
    year = {2023}
}

License

MIT