Home

Awesome

Compositional Contrastive Learning

PyTorch implementation on Distilling Audio-Visual Knowledge by Compositional Contrastive Learning.

Introduction

Distilling knowledge from the pre-trained teacher models helps to learn a small student model that generalizes better. While existing works mostly focus on distilling knowledge within the same modality, we explore to distill the multi-modal knowledge available in video data (i.e. audio and vision). Specifically, we propose to transfer audio and visual knowledge from pre-trained image and audio teacher models to learn more expressive video representations.

In multi-modal distillation, there often exists a semantic gap across modalities, e.g. a video shows applying lipstick visually while its accompanied audio is music. To ensure effective multi-modal distillation in the presence of a cross-modal semantic gap, we propose compositional contrastive learning, which features learnable compositional embeddings to close the cross-modal semantic gap, and a multi-class contrastive distillation objective to align different modalities jointly in the shared latent space.

We demonstrate our method can distill knowledge from the audio and visual modalities to learn a stronger video model for recognition and retrieval tasks on video action recognition datasets.

<p align="center"> <img src="https://github.com/yanbeic/CCL/blob/main/figure/overview.png" width="75%"> </p>

Getting Started

Prerequisites:

Data Preparation on UCF101 (example):

python util_scripts/generate_video_hdf5.py --dir_path=${UCF101_DATA_DIR} --dst_path=datasets/UCF101/hdf5data --dataset=ucf101
python util_scripts/ucf101_json.py --dir_path=datasets/UCF101/ucfTrainTestlist --video_path=datasets/UCF101/hdf5data --audio_path=datasets/UCF101/audiocnn14embed512_features --dst_path=datasets/UCF101/ --video_type=hdf5

Training & Testing:

The running commands for both training and testing are written in the same script file. Experiments are conducted on 2 gpus. Please refer to the script files in the directory scripts for details. Use the folllowing commands to test on the UCF51 dataset.

sh scripts/run_baseline.sh
sh scripts/run_ccl_audio.sh
sh scripts/run_ccl_image.sh
sh scripts/run_ccl_ai.sh

Bibtex

@inproceedings{chen2021distilling,
  title={Distilling Audio-Visual Knowledge by Compositional Contrastive Learning},
  author={Chen, Yanbei and Xian, Yongqin and Koepke, Sophia and Shan, Ying and Akata, Zeynep},
  booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
  year={2021},
  organization={IEEE}
}

Acknowledgement

This repository is partially built with two open-source implementation: (1) 3D-ResNets-PyTorch is used in video data preparation; (2) PANNs is used for audio feature extraction.