Awesome
SEVERE Benchmark
📰 News
[2023.8.22] Code and pre-trained models of Tubelet Contrast will be released soon! Keep a look at this repo!<br> [2023.8.22] Code for evaluation of Tubelet Contrast pretrained models is added this repo. 🎉<br> [2023.7.13] Our [Tubelet Contrast] (https://arxiv.org/abs/2303.11003) paper is accepted by ICCV 2023! 🎉<br>
Official code for our ECCV 2022 paper How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?
TL;DR. We propose the SEVERE (<ins>SE</ins>nsitivity of <ins>V</ins>id<ins>E</ins>o <ins>RE</ins>presentations) benchmark for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.
Overview of Experiments
We evaluate 9 video self-supervised learning (VSSL) methods on 7 video datasets for 6 video understanding tasks.
Evaluated VSSL models
Below are the video self-suprevised methods that we evaluate.
- For SeLaVi, MoCo, VideoMoCO, Pretext-Contrast, CtP, TCLR and GDT we use the Kinetics-400 pretrained R(2+1D)-18 weights provided by the Authors.
- For RSPNet and AVID-CMA the author provided R(2+1D)-18 weights differ from the R(2+1D)-18 architecture defined in 'A Closer Look at Spatiotemporal Convolutions for Action Recognition'. Thus we use the official implementation of the RSPNet and AVID-CMA and to pretrain with the common R(2+1D)-18 backbone on Kinetics-400 dataset.
- For Supervised, We use the Kinetics-400 pretrained R(2+1D)-18 weights from the pytorch library.
Download Kinetics-400 pretrained R(2+1D)-18 weights for each method from here. Unzip the downloaded file and it shall create a folder checkpoints_pretraining/
with all the pretraining model weights.
Experiments
We divide these downstream evaluations across four axes:
I. Downstream domain-shift
We evaluate the sensitivity of self-supervised methods to the domain shift in downstream dataset with respect to the pre-training dataset i.e. Kinetics.
Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream domain datasets like .
II. Downstream sample-sizes
We evaluate the sensitivity of self-supervised methods to the amount of downstream samples available for finetuning.
Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream samples.
III. Downstream action granularities
We investigate whether self-supervised methods can learn fine-grained features required for recognizing semantically similar actions.
<!--- We evaluate on various subsets defined for [Fine-Gym](https://sdolivia.github.io/FineGym/) dataset. -->Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream actions.
IV. Downstream task-shift
We study the sensitivity of video self-supervised methods to nature of the downstream task.
In-domain task shift: For task-shift within-domain, we evaluate the UCF dataset for the task of repetition counting. Please refer to Repetition-Counting/README.md for steps to reproduce experiments.
Out-of-domain task shift: For task-shift as well as domain shift, we evaluate on multi-label action classification on Charades and action detection on AVA. Please refer to action_detection_multi_label_classification/README.md for steps to reproduce the experiments.
The SEVERE Benchmark
From our analysis we distill the SEVERE-benchmark, a subset of our experiments, that can be useful for evaluating current and future video representations beyond standard benchmarks.
Citation
If you use our work or code, kindly consider citing our paper:
@inproceedings{thoker2022severe,
author = {Thoker, Fida Mohammad and Doughty, Hazel and Bagad, Piyush and Snoek, Cees},
title = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
journal = {ECCV},
year = {2022},
}
Acknowledgements
Maintainers
:bell: If you face an issue or have suggestions, please create a Github issue and we will try our best to address soon.