Home

Awesome

Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning

arXiv paper

Currently support PCL(VCP). For VCOP and 3DRotNet, codes are still in refactoring.

A sentence to conclude this paper: if you are developing novel pretext task-based methods for video self-supervised learning, do not hesitate to combine contrastive learning loss, which is simple to use and can boost the performance.

Highlights

  1. This paper represents a joint optimization method in self-supervised video representation learning, which can achieve high performance without proposing new pretext tasks;
  2. The effectiveness of our proposal is validated by 3 pretext task baselines and 4 different network backbones;
  3. The proposal is flexible enough to be applied to other methods.

Requirements

This is my experimental environment when preparing this demo code.

[Warning] We have met problems in other projects that different pytorch versions (1.7.0) might cause totally different results.

Usage

Data preparation

I used resized RGB frames from this repo. Frames of videos in UCF101 and HMDB51 datasets can be downloaded directly without decoding.

Tips: There is a folder called TSP_Flows inside v_LongJump_j18_c03 folder in UCF101 dataset and you may meet a problem if you do not handle this. One solution is to delete this folder.

The folder architecture is like path/to/dataset/jpegs_256/video_id/frames.jpg.

Then, you need to edit datasets/ucf101.py and datasets/hmdb51.py to specify the path for dataset. Please change *_dataset_path in line #19.

Training self-supervised learning part with our PCL

python train_vcp_contrast.py

Default settings are

These settings are also fixed for the following process, we do not need to specify --model=r3d --modality=res.

The training will take around 33 hours on one V100 based on our experimental environment.

Models will be saved to ./logs/exp_name. Here, exp_name is directly generated by its corresponding settings.

Evaluation using video retrieval

python retrieve_clips.py --ckpt=/path/to/ssl/best_model --dataset=ucf101

Fine-tuning on video recognition

python ft_classify.py --ckpt=/path/to/ssl/best_model --dataset=ucf101

The testing process will automatically run after training is done.

Results

Video retrieval

Our PCL outperform a set of methods by a large margin. Here we list results using Resnet-18-3D as network backbone. For more results, please refer to our paper.

MethodsBackboneTop1Top5Top10Top20Top50
RandomR3D-1815.325.132.140.853.7
3DRotNetR3D-1814.225.233.543.759.5
VCPR3D-1822.133.842.051.364.7
RTTR3D-1826.148.559.169.682.8
PacePredR3D-1823.838.146.456.669.8
IICR3D-1836.854.163.172.083.3
PCL (3DRotNet)R3D-1833.753.564.173.485.0
PCL (VCP)R3D-1855.171.278.985.592.3

Video recognition

The table lists recognition results on UCF101 and HMDB51 datasets. Other results are from corresponding paper. Because this is the most widely used metrics, we show results based on 4 different network backbones.

Methods in this table do not contain those using other data modalities such as sound and text.

MethodDatePre-trainClipSizeNetworkUCFHMDB
OPN2017UCF227x227VGG59.623.8
DPC2019K40016x224x224R3D-3475.735.7
CBT2019K600+16x112x112S3D79.544.6
SpeedNet2020K40064x224x224S3D-G81.148.8
MemDPC2020K40040x224x224R-2D3D78.141.2
3D-RotNet2018K40016x112x112R3D-1862.933.7
ST-Puzzle2019K40016x112x112R3D-1865.833.7
DPC2019K40016x128x128R3D-1868.234.5
RTT2020UCF16x112x112R3D-1877.347.5
RTT2020K40016x112x112R3D-1879.349.8
PCL (3DRotNet)UCF16x112x112R3D-1882.847.2
PCL (VCP)UCF16x112x112R3D-1883.448.8
PCL (VCP)K40016x112x112R3D-1885.648.0
VCOP2019UCF16x112x112R3D64.929.5
VCP2020UCF16x112x112R3D66.031.5
PRP2020UCF16x112x112R3D66.529.7
IIC2020UCF16x112x112R3D74.438.3
PCL (VCOP)UCF16x112x112R3D78.240.5
PCL (VCP)UCF16x112x112R3D81.145.0
VCOP2019UCF16x112x112C3D65.628.4
VCP2020UCF16x112x112C3D68.532.5
PRP2020UCF16x112x112C3D69.134.5
RTT2020K40016x112x112C3D69.939.6
PCL (VCOP)UCF16x112x112C3D79.841.8
PCL (VCP)UCF16x112x112C3D81.445.2
VCOP2019UCF16x112x112R(2+1)D72.430.9
VCP2020UCF16x112x112R(2+1)D66.332.2
PRP2020UCF16x112x112R(2+1)D72.135.0
RTT2020UCF16x112x112R(2+1)D81.646.4
PacePred2020UCF16x112x112R(2+1)D75.935.9
PacePred2020K40016x112x112R(2+1)D77.136.6
PCL (VCOP)UCF16x112x112R(2+1)D79.241.6
PCL (VCP)UCF16x112x112R(2+1)D79.945.6
PCL (VCP)K40016x112x112R(2+1)D85.747.4

Citation

If you find our work helpful for your research, please consider citing the paper

@article{tao2021pcl,
    title={Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning},
    author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
    journal={arXiv preprint arXiv:2010.15464},
    year={2021}
}

If you find the residual input helpful for video-related tasks, please consider citing the paper

@article{tao2020rethinking,
  title={Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2001.05661},
  year={2020}
}

@inproceedings{tao2020motion,
  title={Motion Representation Using Residual Frames with 3D CNN},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
  pages={1786--1790},
  year={2020},
  organization={IEEE}
}

Acknowledgements

Part of this code is reused from IIC and VCOP.