Home

Awesome

<img src="figs/bear.png" width="30"/> BEAR: a new BEnchmark on video Action Recognition

This repo contains the data and pre-trained models in "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition"

Andong Deng*, Taojiannan Yang*, Chen Chen<br> Center for Research in Computer Vision, University of Central Florida

[CVF]

If you find our work useful in your research, please cite:

@article{deng2023BEAR,
  title={A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition},
  author={Deng, Andong and Yang, Taojiannan and Chen, Chen},
  journal={arXiv preprint arXiv:2303.13505},
  year={2023}
}

Updates

04/21/2024 Update HuggingFace link for pre-trained models.

08/08/2023 Update Dropbox link for pre-trained models.

07/17/2023 BEAR is accepted by ICCV 2023!

03/24/2023 Update Dropbox link for Mini-Sports1M.

03/23/2023 Initial commits

Introduction

<div align="center"> <img src="figs/BEAR_teaser.jpg"> </div> The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations.

To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce <img src="figs/bear.png" width="14"/>BEAR, a new BEnchmark on video Action Recognition. <img src="figs/bear.png" width="14"/>BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With <img src="figs/bear.png" width="14"/>BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-arts cannot solidly guarantee high performance on datasets close to real-world applications and we hope <img src="figs/bear.png" width="14"/>BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners.

The evaluation is extremely simple since we provide all scripts in this codebase. The users only need to download datasets and run the scripts provided.

Datasets

The following table includes all the statistics about the 18 datasets collected in <img src="figs/bear.png" width="14"/>BEAR:

DatasetDomain# Classes# ClipAvg Length (sec.)Training data per class (min, max)Split ratioVideo sourceVideo viewpoint
XD-ViolenceAnomaly5413514.94(36, 2046)3.64:1Movies, sports, CCTV, etc.3rd, sur.
UCF CrimeAnomaly12600132.51383.17:1CCTV Camera3rd, sur.
MUVIMAnomaly2112768.1(296, 604)3.96:1Self-collected3rd, sur.
WLASL100Gesture10013751.23(7, 20)5.37:1Sign language website3rd
JesterGesture271333493(3216, 9592)8.02:1Self-collected3rd
UAV HumanGesture155224765(20, 114)2:1Self-collected3rd, dro.
CharadesEgoDaily1574210710.93(26, 1120)3.61:1YouTube1st
Toyota SmarthomeDaily31142621.78(23, 2312)1.63:1Self-collected3rd, sur.
Mini-HACSDaily200100002504:1YouTube1st, 3rd
MPII CookingDaily673748153.04(5, 217)4.69:1Self-collected3rd
Mini-Sports1MSports4872435010504:1YouTube3rd
FineGym99Sports99203891.65(33, 951)2.24:1Competition videos3rd
MOD20Sports2023247.4(73, 107)2.29:1YouTube and self-collected3rd, dro.
COINInstructional1801042637.01(10, 63)3.22:1YouTube1st, 3rd
MECCANOInstructional6178802.82(2, 1157)1.79:1Self-collected1st
INHARDInstructional1453031.36(27, 955)2.16:1Self-collected3rd
PETRAWInstructional797272.16(122, 1262)1.5:1Self-collected1st
MISAWInstructional2015513.8(1, 316)2.38:1Self-collected1st

Datasets Download and Pre-processing

We provide downloading and pre-processing pipeline here for each dataset.

The HuggingFace link for part of BEAR datasets are here:

Mini-Sports1M Jester FineGym MOD20 MPII-Cooking2

Pre-trained Models

We prepare Kinetics-400 pre-trained models with both supervised and self-supervised pre-training:

The updated HuggingFace Link for both self-supervised pretraining and supervised pretraining are here:

SSL SUP

The pre-trained models can be downloaded from below if needed:

modelSupervised (Top-1 Accuracy)Self-supervised (KNN evaluation)
TSN77.6 Dropbox43.1 Dropbox
TSM76.4 Dropbox43.2 Dropbox
I3D74.2 Dropbox51.3 Dropbox
NL73.9 Dropbox50.7 Dropbox
TimeSformer75.8 Dropbox50.3 Dropbox
VideoSwin77.6 Dropbox51.1 Dropbox

Benchmark

Based on pre-trained models on Kinetics400, we provide 4 types of evaluation paradigms in <img src="figs/bear.png" width="14"/>BEAR:

BEAR-Standard<br> BEAR-Fewshot<br> BEAR-Zeroshot<br> BEAR-UDA<br>

Standard Finetuning

We build our stanard finetuning based on a popular video understanding toolbox MMAction2.

We provide specific training steps here.

The finetuning results of supervised pre-training are shown below:

DatasetTSNTSMI3DNLTimeSformerVideoSwin
XD-Violence85.5482.9679.9379.9182.5182.40
UCF-Crime35.4242.3631.9434.0336.1134.72
MUVIM79.3010097.8098.6894.71100
WLASL29.6343.9849.0752.3137.9645.37
Jester86.3195.2192.9993.4993.4294.27
UAV-Human27.8938.8433.4933.0328.9338.66
CharadesEGO8.268.116.136.428.588.55
Toyota Smarthome74.7382.2279.5176.8669.2179.88
Mini-HACS84.6980.8777.7479.5179.8184.94
MPII Cooking38.3946.7448.7142.1940.9746.59
Mini-Sports1M54.1150.0646.9046.1651.7955.34
FineGym63.7380.9572.0071.2163.9265.02
MOD2098.3096.7596.6196.1894.0692.64
COIN81.1578.4973.7974.3082.9976.27
MECCANO41.0639.2836.8836.1340.9538.89
InHARD84.3988.0882.0686.3185.1687.60
PETRAW94.3095.7294.8494.5494.3096.43
MISAW61.4475.1668.1964.2771.4669.06

The finetuning results of self-supervised pre-training are shown below:

DatasetTSNTSMI3DNLTimeSformerVideoSwin
XD-Violence80.4981.7380.3880.9477.4777.91
UCF-Crime37.5035.4234.0334.7236.1134.03
MUVIM99.1210066.9666.9699.12100
WLASL27.0127.7829.1730.5625.5628.24
Jester83.2295.3287.2393.8990.3390.18
UAV-Human15.7030.7531.9526.2821.0235.12
CharadesEGO6.296.596.246.317.597.65
Toyota Smarthome68.7181.3477.8276.1661.6480.18
Mini-HACS64.6063.2470.2460.5773.9275.58
MPII Cooking34.4550.0842.7940.3635.8147.19
Mini-Sports1M43.0243.5946.2845.5644.6047.60
FineGym54.6275.8769.6268.7947.6058.94
MOD2091.2392.0891.9492.0890.8192.36
COIN61.4864.5371.5772.7867.6468.78
MECCANO32.3435.1034.8633.6233.3037.80
InHARD75.6387.6682.5480.8171.2880.10
PETRAW93.1895.5195.0294.3885.5691.46
MISAW59.0473.6470.3764.2760.7868.85

Few-shot Finetuning

Please follow the instructions here to perform few-shot evaluation on <img src="figs/bear.png" width="14"/>BEAR.

The few-shot results are shown below:

<div align="center"> <img src="figs/fewshot.png"> </div>

Zero-shot Evaluation

We build our zero-shot part based on the popular CLIP and ActionCLIP. Follow the instructions here to evaluate zero-shot performance on <img src="figs/bear.png" width="14"/>BEAR.

The zero-shot results are shown below:

<div align="center"> <img src="figs/zeroshot.jpeg"> </div>

Domain Adaptation

Please follow the instructions here to perform UDA evaluation on <img src="figs/bear.png" width="14"/>BEAR.

The UDA baseline results are shown below:

DatasetT>MM>TMS>MODMOD>MSU>XX>UP>MSJesterIT>ILIT>IRIL>IRIL>ITIR>ITIR>L
Source only5.327.3618.2512.7654.2033.3361.4568.734.1830.3919.0122.6524.1412.42
Supervised target70.2165.1334.0835.5275.0663.8994.4097.6126.0083.5583.5585.5285.5226.00