Awesome
Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning
Currently support PCL(VCP). For VCOP and 3DRotNet, codes are still in refactoring.
A sentence to conclude this paper: if you are developing novel pretext task-based methods for video self-supervised learning, do not hesitate to combine contrastive learning loss, which is simple to use and can boost the performance.
Highlights
- This paper represents a joint optimization method in self-supervised video representation learning, which can achieve high performance without proposing new pretext tasks;
- The effectiveness of our proposal is validated by 3 pretext task baselines and 4 different network backbones;
- The proposal is flexible enough to be applied to other methods.
Requirements
This is my experimental environment when preparing this demo code.
- Ubuntu 18.04.4 LTS
- conda 4.8.4
- PyTorch 1.4.0
[Warning] We have met problems in other projects that different pytorch versions (1.7.0) might cause totally different results.
- python 3.8.3
- cuda 10.1
- accimage
Usage
Data preparation
I used resized RGB frames from this repo. Frames of videos in UCF101 and HMDB51 datasets can be downloaded directly without decoding.
Tips: There is a folder called
TSP_Flows
insidev_LongJump_j18_c03 folder
in UCF101 dataset and you may meet a problem if you do not handle this. One solution is to delete this folder.
The folder architecture is like path/to/dataset/jpegs_256/video_id/frames.jpg
.
Then, you need to edit datasets/ucf101.py
and datasets/hmdb51.py
to specify the path for dataset. Please change *_dataset_path
in line #19.
Training self-supervised learning part with our PCL
python train_vcp_contrast.py
Default settings are
- Method: PCL (VCP)
- Backbone: R3D
- Modality: Res
- Augmentation: RandomCrop, ColorJitter, RandomGrayScale, GaussianBlur, RandomHorizontalFlip
- Dataset: UCF101
- Split: 1
These settings are also fixed for the following process, we do not need to specify --model=r3d --modality=res
.
The training will take around 33 hours on one V100 based on our experimental environment.
Models will be saved to ./logs/exp_name
. Here, exp_name
is directly generated by its corresponding settings.
Evaluation using video retrieval
python retrieve_clips.py --ckpt=/path/to/ssl/best_model --dataset=ucf101
Fine-tuning on video recognition
python ft_classify.py --ckpt=/path/to/ssl/best_model --dataset=ucf101
The testing process will automatically run after training is done.
Results
Video retrieval
Our PCL outperform a set of methods by a large margin. Here we list results using Resnet-18-3D as network backbone. For more results, please refer to our paper.
Methods | Backbone | Top1 | Top5 | Top10 | Top20 | Top50 |
---|---|---|---|---|---|---|
Random | R3D-18 | 15.3 | 25.1 | 32.1 | 40.8 | 53.7 |
3DRotNet | R3D-18 | 14.2 | 25.2 | 33.5 | 43.7 | 59.5 |
VCP | R3D-18 | 22.1 | 33.8 | 42.0 | 51.3 | 64.7 |
RTT | R3D-18 | 26.1 | 48.5 | 59.1 | 69.6 | 82.8 |
PacePred | R3D-18 | 23.8 | 38.1 | 46.4 | 56.6 | 69.8 |
IIC | R3D-18 | 36.8 | 54.1 | 63.1 | 72.0 | 83.3 |
PCL (3DRotNet) | R3D-18 | 33.7 | 53.5 | 64.1 | 73.4 | 85.0 |
PCL (VCP) | R3D-18 | 55.1 | 71.2 | 78.9 | 85.5 | 92.3 |
Video recognition
The table lists recognition results on UCF101 and HMDB51 datasets. Other results are from corresponding paper. Because this is the most widely used metrics, we show results based on 4 different network backbones.
Methods in this table do not contain those using other data modalities such as sound and text.
Method | Date | Pre-train | ClipSize | Network | UCF | HMDB |
---|---|---|---|---|---|---|
OPN | 2017 | UCF | 227x227 | VGG | 59.6 | 23.8 |
DPC | 2019 | K400 | 16x224x224 | R3D-34 | 75.7 | 35.7 |
CBT | 2019 | K600+ | 16x112x112 | S3D | 79.5 | 44.6 |
SpeedNet | 2020 | K400 | 64x224x224 | S3D-G | 81.1 | 48.8 |
MemDPC | 2020 | K400 | 40x224x224 | R-2D3D | 78.1 | 41.2 |
3D-RotNet | 2018 | K400 | 16x112x112 | R3D-18 | 62.9 | 33.7 |
ST-Puzzle | 2019 | K400 | 16x112x112 | R3D-18 | 65.8 | 33.7 |
DPC | 2019 | K400 | 16x128x128 | R3D-18 | 68.2 | 34.5 |
RTT | 2020 | UCF | 16x112x112 | R3D-18 | 77.3 | 47.5 |
RTT | 2020 | K400 | 16x112x112 | R3D-18 | 79.3 | 49.8 |
PCL (3DRotNet) | UCF | 16x112x112 | R3D-18 | 82.8 | 47.2 | |
PCL (VCP) | UCF | 16x112x112 | R3D-18 | 83.4 | 48.8 | |
PCL (VCP) | K400 | 16x112x112 | R3D-18 | 85.6 | 48.0 | |
VCOP | 2019 | UCF | 16x112x112 | R3D | 64.9 | 29.5 |
VCP | 2020 | UCF | 16x112x112 | R3D | 66.0 | 31.5 |
PRP | 2020 | UCF | 16x112x112 | R3D | 66.5 | 29.7 |
IIC | 2020 | UCF | 16x112x112 | R3D | 74.4 | 38.3 |
PCL (VCOP) | UCF | 16x112x112 | R3D | 78.2 | 40.5 | |
PCL (VCP) | UCF | 16x112x112 | R3D | 81.1 | 45.0 | |
VCOP | 2019 | UCF | 16x112x112 | C3D | 65.6 | 28.4 |
VCP | 2020 | UCF | 16x112x112 | C3D | 68.5 | 32.5 |
PRP | 2020 | UCF | 16x112x112 | C3D | 69.1 | 34.5 |
RTT | 2020 | K400 | 16x112x112 | C3D | 69.9 | 39.6 |
PCL (VCOP) | UCF | 16x112x112 | C3D | 79.8 | 41.8 | |
PCL (VCP) | UCF | 16x112x112 | C3D | 81.4 | 45.2 | |
VCOP | 2019 | UCF | 16x112x112 | R(2+1)D | 72.4 | 30.9 |
VCP | 2020 | UCF | 16x112x112 | R(2+1)D | 66.3 | 32.2 |
PRP | 2020 | UCF | 16x112x112 | R(2+1)D | 72.1 | 35.0 |
RTT | 2020 | UCF | 16x112x112 | R(2+1)D | 81.6 | 46.4 |
PacePred | 2020 | UCF | 16x112x112 | R(2+1)D | 75.9 | 35.9 |
PacePred | 2020 | K400 | 16x112x112 | R(2+1)D | 77.1 | 36.6 |
PCL (VCOP) | UCF | 16x112x112 | R(2+1)D | 79.2 | 41.6 | |
PCL (VCP) | UCF | 16x112x112 | R(2+1)D | 79.9 | 45.6 | |
PCL (VCP) | K400 | 16x112x112 | R(2+1)D | 85.7 | 47.4 |
Citation
If you find our work helpful for your research, please consider citing the paper
@article{tao2021pcl,
title={Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
journal={arXiv preprint arXiv:2010.15464},
year={2021}
}
If you find the residual input helpful for video-related tasks, please consider citing the paper
@article{tao2020rethinking,
title={Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
journal={arXiv preprint arXiv:2001.05661},
year={2020}
}
@inproceedings{tao2020motion,
title={Motion Representation Using Residual Frames with 3D CNN},
author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
pages={1786--1790},
year={2020},
organization={IEEE}
}