

SEVERE Benchmark

📰 News

[2023.8.22] Code and pre-trained models of Tubelet Contrast will be released soon! Keep a look at this repo!<br> [2023.8.22] Code for evaluation of Tubelet Contrast pretrained models is added this repo. 🎉<br> [2023.7.13] Our [Tubelet Contrast] (https://arxiv.org/abs/2303.11003) paper is accepted by ICCV 2023! 🎉<br>

Official code for our ECCV 2022 paper How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?

TL;DR. We propose the SEVERE (<ins>SE</ins>nsitivity of <ins>V</ins>id<ins>E</ins>o <ins>RE</ins>presentations) benchmark for evaluating the generalizability of representations obtained by existing and future self-supervised video learning methods.

Overview of Experiments

We evaluate 9 video self-supervised learning (VSSL) methods on 7 video datasets for 6 video understanding tasks.

Evaluated VSSL models

Below are the video self-suprevised methods that we evaluate.


Download Kinetics-400 pretrained R(2+1D)-18 weights for each method from here. Unzip the downloaded file and it shall create a folder checkpoints_pretraining/ with all the pretraining model weights.


We divide these downstream evaluations across four axes:

I. Downstream domain-shift

We evaluate the sensitivity of self-supervised methods to the domain shift in downstream dataset with respect to the pre-training dataset i.e. Kinetics.

Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream domain datasets like .

II. Downstream sample-sizes

We evaluate the sensitivity of self-supervised methods to the amount of downstream samples available for finetuning.

Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream samples.

III. Downstream action granularities

We investigate whether self-supervised methods can learn fine-grained features required for recognizing semantically similar actions.

<!--- We evaluate on various subsets defined for [Fine-Gym](https://sdolivia.github.io/FineGym/) dataset. -->

Please refer to action_recognition/README.md for steps to reproduce the experiments with varying downstream actions.

IV. Downstream task-shift

We study the sensitivity of video self-supervised methods to nature of the downstream task.

In-domain task shift: For task-shift within-domain, we evaluate the UCF dataset for the task of repetition counting. Please refer to Repetition-Counting/README.md for steps to reproduce experiments.

Out-of-domain task shift: For task-shift as well as domain shift, we evaluate on multi-label action classification on Charades and action detection on AVA. Please refer to action_detection_multi_label_classification/README.md for steps to reproduce the experiments.

The SEVERE Benchmark

From our analysis we distill the SEVERE-benchmark, a subset of our experiments, that can be useful for evaluating current and future video representations beyond standard benchmarks.


If you use our work or code, kindly consider citing our paper:

  author    = {Thoker, Fida Mohammad and Doughty, Hazel and Bagad, Piyush and Snoek, Cees},
  title     = {How Severe is Benchmark-Sensitivity in Video Self-Supervised Learning?},
  journal   = {ECCV},
  year      = {2022},



:bell: If you face an issue or have suggestions, please create a Github issue and we will try our best to address soon.