Home

Awesome

EPIC-KITCHENS Action Recognition baselines

Train/Val/Test splits and annotations are available at Annotations Repo

To participate and submit to this challenge, register at Action Recognition Codalab Challenge

This repo contains:

News

2021/02/20 - It was pointed out in Issue 6 that our checkpoints have the mean and standard deviation for data preprocessing set to the same value (the ImageNet mean). We have retrained the networks with the mean set to the ImageNet mean and the std set to the ImageNet std and found that we observed worse results for all models (TSN/TRN/TSM) (see Issue 8 for details). Consequently, we will retain our original checkpoints that yield a higher accuracy that were trained with mean = [0.485, 0.456, 0.406] and std = [0.485, 0.456, 0.406].

Table of Contents

Environment setup

We provide a conda environment definition in environment.yml that defines the dependencies you need to run this codebase. Simply set up the environment by running

$ conda env create -n epic-models -f environment.yml
$ conda activate epic-models

Prep data

Gulp the train/validation/test sets from the provided extracted frames

RGB

$ python src/gulp_data.py \
    /path/to/rgb/frames \
    gulp/rgb_train \
    /path/to/EPIC_100_train.pkl \
    rgb
$ python src/gulp_data.py \
    /path/to/rgb/frames \
    gulp/rgb_validation \
    /path/to/EPIC_100_validation.pkl \
    rgb
$ python src/gulp_data.py \
    /path/to/rgb/frames \
    gulp/rgb_test \
    /path/to/EPIC_100_test_timestamps.pkl
    rgb

Optical Flow

First we need to convert the frame numbers from those for RGB frames to those of the flow frames (since in 2018 we extracted optical flow for every other frame).

$ python src/convert_rgb_to_flow_frame_idxs.py \
    /path/to/EPIC_100_train.pkl \
    EPIC_100_train_flow.pkl
$ python src/convert_rgb_to_flow_frame_idxs.py \
    /path/to/EPIC_100_validation.pkl \
    EPIC_100_validation_flow.pkl
$ python src/convert_rgb_to_flow_frame_idxs.py \
    /path/to/EPIC_100_test_timestamps.pkl \
    EPIC_100_test_timestamps_flow.pkl

We can then proceed with gulping the data.


$ python src/gulp_data.py \
    /path/to/flow/frames \
    gulp/flow_train \
    EPIC_100_train_flow.pkl \
    flow
$ python src/gulp_data.py \
    /path/to/flow/frames \
    gulp/flow_validation \
    EPIC_100_validation_flow.pkl \
    flow
$ python src/gulp_data.py \
    /path/to/flow/frames \
    gulp/flow_test \
    EPIC_100_test_timestamps_flow.pkl \
    flow

Validating the data

Check out notebooks/dataset.ipynb to visualise the gulped RGB and optical flow as a sanity check.

Training

We provide configurations for training the models to reproduce the results in Table 3 of "Rescaling Egocentric Vision".

We first train networks on each modality separately, then we produce results on the validation/test set and fuse the results of the modality pre-softmax by averaging them. See the next section for how to do this.

Training is implemented using Pytorch Lightning and configuration managed by hydra.

To train a network, run the following:

# See configs/tsn_rgb.yaml for an example configuration file.
# You can overwrite config files by passing key-value pairs as arguments
# You can change the config by setting --config-name to the name of a file in configs
# without the yaml suffix.
$ python src/train.py \
    --config-name tsn_rgb \
    data._root_gulp_dir=/path/to/gulp/root \
    data.worker_count=$(nproc) \
    learning.batch_size=64 \
    trainer.gpus=4 \
    hydra.run.dir=outputs/experiment-name

# View logs with tensorboard
$ tensorboard --logdir outputs/experiment-name --bind_all

If you want to resume a checkpoint partway through training, then run

$ python src/train.py \
    --config-name tsn_rgb \
    data._root_gulp_dir=/path/to/gulp/root \
    data.worker_count=$(nproc) \
    learning.batch_size=64 \
    trainer.gpus=4 \
    hydra.run.dir=outputs/experiment-name \
    +trainer.resume_from_checkpoint="'$PWD/outputs/experiment-name/lightning_logs/version_0/checkpoints/epoch=N.ckpt'"

Note the use of single quotes within the double quotes, this is to protect the string from hydra interpreting the = as a malformed key-value pair.

Any keyword arguments can be injected into the pytorch-lightning Trainer object through the CLI by using +trainer.<kwarg>=<value>.

Testing

Once you have trained a model, you can test that model by using the test.py script which takes the checkpoint file and writes a prediction.pt file containing the model output for all examples in the validation or test set.

# Get model results on the validation set for computing metrics
$ python src/test.py \
    outputs/experiment-name/lightning_logs/version_0/checkpoints/epoch=N.ckpt \
    outputs/experiment-name/lightning_logs/version_0/results/val_results_epoch=N.pt \
    --split val

# Get model results on the test set for submission to the challenge
$ python src/test.py \
    outputs/experiment-name/lightning_logs/version_0/checkpoints/epoch=N.ckpt \
    outputs/experiment-name/lightning_logs/version_0/results/test_results_epoch=N.pt \
    --split test

You can fused results from multiple modalities:

$ python src/fuse.py \
    outputs/experiment-name-rgb/lightning_logs/version_0/results/test_results_epoch=N.pt \
    outputs/experiment-name-flow/lightning_logs/version_0/results/test_results_epoch=N.pt \
    experiment-name-fused.pt

These fused results can then be passed to the evaluation script or JSON submission generation script like any other single-modality results file.

Evaluating models and competition submissions

Please see details in https://github.com/epic-kitchens/C1-Action-Recognition for how to evaluate the models in this repo.

Training from existing weights

If you already have some weights you wish to use as an initialisation, you can use these by specifying +model.weights=path/to/weights as an argument when training. This must be a torch.save serialised state dictionary for the full model. If you only have partial weights which you wish to use to initalise the model, then simply dump a randomly initialise state dict for the model, and then inject the weights you have into that.

Pretrained models

We provide models pretrained on the training set of EPIC-KITCHENS-100.

ModelModalityAction@1 (Val)Action@1 (Test)Link
TSNRGB27.4024.11https://www.dropbox.com/s/4i99mzddk95edyq/tsn_rgb.ckpt?dl=1
TSNFlow22.8624.62https://www.dropbox.com/s/res0i1ns7v30g9y/tsn_flow.ckpt?dl=1
TRNRGB32.6429.54https://www.dropbox.com/s/l1cs7kozz3f03r4/trn_rgb.ckpt?dl=1
TRNFlow22.9723.43https://www.dropbox.com/s/4rehj36vyip82mu/trn_flow.ckpt?dl=1
TSMRGB35.7532.82https://www.dropbox.com/s/5yxnzubch7b6niu/tsm_rgb.ckpt?dl=1
TSMFlow27.7927.99https://www.dropbox.com/s/8x9hh404k641rqj/tsm_flow.ckpt?dl=1

Acknowledgements

If you make use of this repository, please cite our dataset papers:

@ARTICLE{Damen2020RESCALING,
   title={Rescaling Egocentric Vision},
   author={Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria  and and Furnari, Antonino
           and Ma, Jian and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan
           and Perrett, Toby and Price, Will and Wray, Michael},
           journal   = {CoRR},
           volume    = {abs/2006.13256},
           year      = {2020},
           ee        = {http://arxiv.org/abs/2006.13256},
}

@INPROCEEDINGS{Damen2018EPICKITCHENS,
   title={Scaling Egocentric Vision: The EPIC-KITCHENS Dataset},
   author={Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria  and Fidler, Sanja and
           Furnari, Antonino and Kazakos, Evangelos and Moltisanti, Davide and Munro, Jonathan
           and Perrett, Toby and Price, Will and Wray, Michael},
   booktitle={European Conference on Computer Vision (ECCV)},
   year={2018}
}

TSN

We thank the authors of TSN for providing their codebase, from which we took:

Please cite their work if you make use of this network

@InProceedings{wang2016_TemporalSegmentNetworks,
    title={Temporal Segment Networks: Towards Good Practices for Deep Action Recognition},
    author={Limin Wang and Yuanjun Xiong and Zhe Wang and Yu Qiao and Dahua Lin and
            Xiaoou Tang and Luc {Val Gool}},
    booktitle={The European Conference on Computer Vision (ECCV)},
    year={2016}
}

TRN

We thank the authors of TRN for providing their codebase, from which we took:

Please cite their work if you make use of this network

@article{zhou2017temporalrelation,
    title = {Temporal Relational Reasoning in Videos},
    author = {Zhou, Bolei and Andonian, Alex and Oliva, Aude and Torralba, Antonio},
    journal={European Conference on Computer Vision},
    year={2018}
}

TSM

We thank the authors of TSM for providing their codebase, from which we took:

Please cite their work if you use this network

@inproceedings{lin2019tsm,
  title={TSM: Temporal Shift Module for Efficient Video Understanding},
  author={Lin, Ji and Gan, Chuang and Han, Song},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2019}
}

License

Copyright University of Bristol. The repository is published under the Creative Commons Attribution-NonCommercial 4.0 International License. This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. You may not use the material for commercial purposes.