Awesome

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.

Load Pretrained Models

We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.

Pretraining Name	Description	UCF101	HMDB51	Weights
`r2plus1d_18_xdc_ig65m_kinetics`	XDC pretrained on IG-Kinetics	95.5	68.9	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_ig65m_random`	XDC pretrained on IG-Random	94.6	66.5	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_audioset`	XDC pretrained on AudioSet	93.0	63.7	[PyTorch] [Caffe2]
`r2plus1d_18_fs_kinetics`	fully-supervised pretraining on Kinetics	94.2	65.1	[PyTorch] [Caffe2]
`r2plus1d_18_fs_imagenet`	fully-supervised pretraining on ImageNet	84.0	48.1	[PyTorch] [Caffe2]

There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.

Via PyTorch Hub (Recommended)

:warning: [Known Issue] Using this way to load XDC models breaks for torchvision v0.13 or higher due to backward incompatible changes introduced in torchvision. Please make sure to use trochvision v0.12 or earlier when loading XDC models via the torch.hub.load() API. Loading models via source code still works as expected.

You can load all our pretrained models using torch.hub.load() API.

import torch

model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder', 
                        pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                        num_classes=42)

Use the parameter pretraining to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes) is randomly-initialized. Specify the keyword argument num_classes based on your application (default is 400). Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder')) for the model documentation. Learn more about PyTorch Hub here.

Via Source Code

Clone this repo and create the conda environment.

git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc

Load the pretrained models from the file xdc.py.

from xdc import xdc_video_encoder

model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                          num_classes=42)

Feature Extraction and Model Finetuning

Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.

Please cite this work if you find XDC useful for your research.

@inproceedings{alwassel_2020_xdc,
  title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
  author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and 
          Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}