Home

Awesome

Awesome-self-supervised-multimodal-learning

PRs WelcomeAwesome

A curated list of awesome self-supervised multimodal learning resources. Check our survey paper for details!

@article{zong2024self,
  title={Self-Supervised Multimodal Learning: A Survey},
  author={Zong, Yongshuo and Mac Aodha, Oisin and Hospedales, Timothy},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Table of Contents

Overview

Taxonomy: Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment.

<p align="center"> <img src=figs/taxo.png width="500"> </p>

Learning Paradigms: An example illustrating the self-supervised vision and language pretraining prior to downstream supervised learning for visual question answering is shown below. (a) supervised multimodal learning, and (b) self-supervised multimodal learning: Top, self-supervised pretraining without manual annotations; Bottom, supervised fine-tuning or linear readout for downstream tasks.

<p align="center"> <img src="figs/paradigms.png" width="500"> </p>

Related Survey Papers

Objectives

Instance Discrimination

In the context of multimodal learning, instance discrimination often aims to determine whether samples from two input modalities are from the same instance, i.e., paired. By doing so, it attempts to align the representation space of the paired modalities while pushing the representation space of different instance pairs further apart. There are two types of instance discrimination objectives: contrastive and matching prediction, depending on how the input is sampled.

<p align="center"> <img src=figs/InstanceD.png width="700"> </p>

Clustering

Clustering methods assume that applying end-to-end trained clustering will lead to the grouping of the data by semantically salient characteristics. In practice, these methods iteratively predict the cluster assignments of the encoded representation, and use these predictions, also known as pseudo labels, as supervision signals to update the feature representation. Multimodal clustering provides the opportunity to learn multimodal representations and also improve conventional clustering by using each modality’s pseudolabels to supervise the other.

<p align="center"> <img src=figs/clustering.png width="700"> </p>

Masked Prediction

The masked prediction task can be either performed in an auto-encoding (similar to BERT) or an auto-regressive approach (similar to GPT).

<p align="center"> <img src=figs/MP.png width="700"> </p>

Hybrid

Applications

State Representation Learning

Healthcare

Remote Sensing

Machine Translation

Auto-driving

Robotics

Challenges

Resources

Robustness/Fairness

Summary of Common Multimodal Datasets

Image-Text Datasets

Name# Images# TextDomainTaskAccessGithub
COCO>330000>1.5MNatural imagesimage captioning, image-text retrievalLinkGithub
Flickr30k31,0005 sentences for each imageNatural imagesimage captioning, image-text retrievalLink-
FlickrStyle10K10,00010,000Natural imagesimage captioning (stylized), image-text retrievalLinkGithub
Flickr8k8,0005 for eachNatural imagesimage captioning, image-text retrievalLinkGithub
Flickr8k-CN8,0008,000Natural imagesimage captioning, image-text retrievalLinkGithub
SentiCap1671/15004892/3977Natural imagesimage captioning (stylized), image-text retrievalLink-
SBU Captions1M1MNatural imagesimage captioning, image-text retrievalLinkLink
Conceptual Captions3M3MNatural imagesimage captioning, image-text retrievalLinkGithub
AIC-ICC210K210KNatural imagesimage captioning, image-text retrievalLinkGithub
Wikipedia28662866documentimage captioning, image-text retrieval?Github
NUS-WIDE-10K10K10KNatural imagesimage captioning, image-text retrievalLink-
Yelp200,1006,990,280product reviewsummarizationLink-
VQA v2.0204,7211105904/11,059,040 (Q/A)Natural imagesVQALink-
ImageCLEF 2019 VQA-Med38253825MedicineVQALinkGithub
VCR110k290k/290k/290k (Q/A/Rationale)naturalvisual commonsense reasoning (VCR)LinkGithub
GD-VCR328886/886(Q/A)Geo-Diversevisual commonsense reasoning (VCR)LinkGithub
SNLI-VEDetailsNatural imagesVisual EntailmentLinkGithub
NLVR2107,292107,292Natural imagesnatural language for visual reasoningLinkGithub
NLVR9224492244synthetic imagesnatural language for visual reasoningLinkGithub
rendered SST2~1k~1kimage of textoptical character recognition (OCR)Link-
OCR-CC1.4M1.4MNatural imagesoptical character recognition (OCR)LinkGithub
Hateful Memes10k+10k+memesoptical character recognition (OCR)LinkGithub
CORD1K1kdocumentOCRLinkGithub
RefCOCO+19,992141,564Natural imagesVisual GroundingLinkGithub

Image-Text-Audio Datasets

Name# Images# TextDomainTaskAccessGithub
Localized Narratives848,749873,107naturalImage captioning, Paragraph generation, VQA, Phrase grounding etc.LinkGithub
open image0.6M0.6MnaturalImage captioning, detection, segmentation, VQA, etcLinkGithub

Video-Text Datasets

Name# Video / # clips# TextDomainTasklinkGithub
ActivityNet Captions20k: 100k100knaturalVideo Captioning, video-text retrievalLinkGithub
V2C9k27knatural (human action)Video Captioning, video-text retrievalLinkGithub
VATEX41.3k826knaturalVideo Captioning, video-text retrievalLinkGithub
YouCook22k:15.4k15.4kCookingVideo Captioning, video-text retrievalLinkGithub
Charades10k:10k27.8kIndoor activityVideo Captioning, video-text retrieval, action recognitionLink-
MSR-VTT7k:10k200knaturalVideo Captioning, video-text retrievalLinkGithub
MSVD2k:2k70knaturalVideo Captioning, video-text retrieval--
HowTo100M1.2M: 136M136Minstructional videoVideo Captioning, video-text retrieval, action locolizationLink-
TGIF102k: 102k126kanimated GIFsVideo Captioning, video-text retrievalLinkGithub
TACoS-MLevel185:25k75kCookingVideo Captioning, video-text retrievalLink-
CrossTask4.7K:-4.7kinstructionalTemporal action localizationLinkGithub
MiningYoutube20k:200k200kCookingTemporal action localizationLinkGithub
COIN11,827:-46,35412 different domainsAction SegmentationLinkGithub
Breakfast-:1126711267cookingAction SegmentationLink-
LSMDC200:128k128kMovieVideo Captioning, video-text retrievalLink-
HOMAGE1.75KIndoor activityActivity ClassificationLinkGithub

Video-Audio Datasets

Name# Video-audio# utteranceDomainTaskAccessGithub
SoundNet2MnaturalAudio-visual correspondenceLinkGithub
MUSIC714music instrumentsAudio-visual correspondenceLinkGithub
AVSpeech290kPersonAudio-visual correspondenceLinkGithub
URMP44music instrumentsAudio-visual correspondenceLink-
AV-Benchv1 ~5k, v2 ~7knaturalAudio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc.LinkGithub
AVE4143naturaltemporal localizationLinkGithub
360° video1146cameraSpatial Audio generationLinkGithub
openposepersonAudio-visual correspondence, music-to-video generationLinkGithub
LRS2-144481personspeech recognition, lips readingLinkGithub
LRS39506151819personspeech recognition, lips readingLink-

Point Cloud Datasets

Name# meshDomainTaskAccessGithub
ModelNet4012,311CAD modelsClassification, reconstructionLinkGithub
ShapeNet220,0003D modelsClassification, reconstructionLink-
ScanObjectNN2902real-world point cloudClassification, reconstructionLinkGithub

Image-Ridar Datasets

Name# images# points (M)DomainTasklinkGithub
Eigen split KITTTI7481+75181799auto drivingdetectionLink-
nuScenesauto driving3D detection and trackingLinkGithub
SemanticKITTI23201+203514549auto drivingsegmentationLinkGithub

Contribute

PR welcome using the following markdown format:

- Paper Name. 
  - *Conference Year*. [[paper]](link) [[code]](link)