Home

Awesome

<div align="center">

ใ€NeurIPS'2022 ๐Ÿ”ฅใ€‘Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Conference Paper

</div>

The implementation of NeurIPS 2022 paper Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations.

<details open><summary>๐Ÿ’ก I also have other video-language projects that may interest you โœจ. </summary><p>

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning<br> Accepted by CVPR 2023 (Highlight) | [HBI Code]<br> Peng Jin, Jinfa Huang, Pengfei Xiong, Shangxuan Tian, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model<br> Accepted by ICCV 2023 | [DiffusionRet Code]<br> Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment<br> Accepted by IJCAI 2023 | [DiCoSA Code]<br> Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

</p></details>

๐Ÿ“ฃ Updates

results

๐Ÿš€ Quick Start

Datasets

<div align=center>
DatasetsGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadDownloadDownload
MSVDDownloadDownloadDownload
ActivityNetTODODownloadDownload
DiDeMoTODODownloadDownload
</div>

Model Zoo

<div align=center>
CheckpointGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadTODODownload
ActivityNetDownloadDownloadDownload
</div>

Text-video Retrieval

Video-question Answering

๐Ÿ“• Overview

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.

motivation

๐Ÿ“š Method

EMCL

๐Ÿ“Œ Citation

If you find this paper useful, please consider staring ๐ŸŒŸ this repo and citing ๐Ÿ“‘ our paper:

@inproceedings{
jin2022expectationmaximization,
title={Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations},
author={Peng Jin and JinFa Huang and Fenglin Liu and Xian Wu and Shen Ge and Guoli Song and David A. Clifton and Jie Chen},
booktitle={Advances in Neural Information Processing Systems},
volume={35},
pages={30291--30306},
editor={Alice H. Oh and Alekh Agarwal and Danielle Belgrave and Kyunghyun Cho},
year={2022}
}

๐ŸŽ—๏ธ Acknowledgments

Our code is based on MMT, CLIP, CLIP4Clip, DRL and CLIP2Video. We sincerely appreciate for their contributions.