Awesome
Awesome-self-supervised-multimodal-learning
A curated list of awesome self-supervised multimodal learning resources. Check our survey paper for details!
@article{zong2024self,
title={Self-Supervised Multimodal Learning: A Survey},
author={Zong, Yongshuo and Mac Aodha, Oisin and Hospedales, Timothy},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
Table of Contents
- Overview
- Related Survey Papers
- Objectives
- Applications
- Challenges
- Summary of Common Multimodal Datasets
Overview
Taxonomy: Multimodal learning, which aims to understand and analyze information from multiple modalities, has achieved substantial progress in the supervised regime in recent years. However, the heavy dependence on data paired with expensive human annotations impedes scaling up models. Meanwhile, given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck. Building on these two directions, self-supervised multimodal learning (SSML) provides ways to learn from raw multimodal data. In this survey, we provide a comprehensive review of the state-of-the-art in SSML, in which we elucidate three major challenges intrinsic to self-supervised learning with multimodal data: (1) learning representations from multimodal data without labels, (2) fusion of different modalities, and (3) learning with unaligned data. We then detail existing solutions to these challenges. Specifically, we consider (1) objectives for learning from multimodal unlabeled data via self-supervision, (2) model architectures from the perspective of different multimodal fusion strategies, and (3) pair-free learning strategies for coarse-grained and fine-grained alignment.
<p align="center"> <img src=figs/taxo.png width="500"> </p>Learning Paradigms: An example illustrating the self-supervised vision and language pretraining prior to downstream supervised learning for visual question answering is shown below. (a) supervised multimodal learning, and (b) self-supervised multimodal learning: Top, self-supervised pretraining without manual annotations; Bottom, supervised fine-tuning or linear readout for downstream tasks.
<p align="center"> <img src="figs/paradigms.png" width="500"> </p>Related Survey Papers
-
Multimodal machine learning: A survey and taxonomy.
- IEEE TPAMI 2018 [paper]
-
Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions.
- arXiv 2022 [paper]
-
Deep multimodal learning: A survey on recent advances and trends.
- IEEE signal processing magazine 2017 [paper]
-
Multimodal research in vision and language: A review of current and emerging trends.
- Information Fusion 2022 [paper]
-
Self-Supervised Representation Learning: Introduction, advances, and challenges.
- IEEE Signal Processing Magazine 2022 [paper]
-
Self-supervised learning: Generative or contrastive.
- IEEE TKDE 2021 [paper]
-
Self-supervised visual feature learning with deep neural networks: A survey.
- IEEE TPAMI 2020 [paper]
-
Vision-language pre-training: Basics, recent advances, and future trends.
- arXiv 2022 [paper]
Objectives
Instance Discrimination
In the context of multimodal learning, instance discrimination often aims to determine whether samples from two input modalities are from the same instance, i.e., paired. By doing so, it attempts to align the representation space of the paired modalities while pushing the representation space of different instance pairs further apart. There are two types of instance discrimination objectives: contrastive and matching prediction, depending on how the input is sampled.
<p align="center"> <img src=figs/InstanceD.png width="700"> </p>-
Learning transferable visual models from natural language supervision.
- ICML 2021 [paper]
-
Self-supervised multimodal versatile networks.
-
End-to-end learning of visual representations from uncurated instructional videos.
-
Scaling up visual and vision-language representation learning with noisy text supervision.
- ICML 2021 [paper]
-
Contrastive Multiview Coding.
-
Audioclip: Extending Clip to Image, Text and Audio.
-
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding.
-
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.
-
PointCLIP: Point Cloud Understanding by CLIP.
-
Image-and-Language Understanding from Pixels Only.
-
Scaling Language-Image Pre-training via Masking.
- arXiv 2022 [paper]
-
COOKIE: Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation.
-
Slip: Self-supervision meets language-image pre-training.
-
Crossclr: Cross-modal contrastive learning for multi-modal video representations.
-
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding.
-
Learnable PINs: Cross-Modal Embeddings for Person Identity.
- ECCV 2018 [paper]
-
Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text.
-
Learning Video Representations using Contrastive Bidirectional Transformer.
- arXiv [paper]
-
Learning representations from audio-visual spatial alignment.
-
Sound Localization by Self-Supervised Time Delay Estimation.
-
Unified visual-semantic embeddings: Bridging vision and language with structured meaning representations.
-
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings.
-
Fine-grained Multi-Modal Self-Supervised Learning.
- BMVC 2021 [paper]
-
Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences.
- CVPR Workshops 2020 [paper]
-
Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization.
- NeurIPS 2018 [paper]
-
Audio-Visual Instance Discrimination with Cross-Modal Agreement.
-
Look, Listen and Learn.
- ICCV 2017 [paper]
-
Objects that Sound.
- ECCV 2018 [paper]
-
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features.
-
The Sound of Pixels.
-
The Sound of Motions.
- ICCV 2019 [paper]
-
Music Gesture for Visual Sound Separation.
- CVPR 2020 [paper]
-
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning.
- ACM MM 2020 [paper]
Clustering
Clustering methods assume that applying end-to-end trained clustering will lead to the grouping of the data by semantically salient characteristics. In practice, these methods iteratively predict the cluster assignments of the encoded representation, and use these predictions, also known as pseudo labels, as supervision signals to update the feature representation. Multimodal clustering provides the opportunity to learn multimodal representations and also improve conventional clustering by using each modality’s pseudolabels to supervise the other.
<p align="center"> <img src=figs/clustering.png width="700"> </p>-
Self-Supervised Learning by Cross-Modal Audio-Video Clustering.
-
Labelling unlabelled videos from scratch with multi-modal self-supervision.
-
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction.
-
u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality.
-
Deep Multimodal Clustering for Unsupervised Audiovisual Learning.
-
Self-labelling via simultaneous clustering and representation learning.
Masked Prediction
The masked prediction task can be either performed in an auto-encoding (similar to BERT) or an auto-regressive approach (similar to GPT).
<p align="center"> <img src=figs/MP.png width="700"> </p>-
VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning.
-
CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations.
-
Jointly Learning Visual and Auditory Speech Representations from Raw Data.
-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks.
-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision.
-
VideoBERT: A Joint Model for Video and Language Representation Learning.
-
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks.
-
VL-BEiT: Generative Vision-Language Pretraining.
-
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation.
-
SelfDoc: Self-Supervised Document Representation Learning.
- CVPR 2021 [paper]
-
Deep Bidirectional Language-Knowledge Graph Pretraining.
-
ERNIE: Enhanced Language Representation with Informative Entities.
-
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.
-
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions.
Hybrid
-
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation.
-
DM2C: Deep Mixed-Modal Clustering.
-
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.
-
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training.
-
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.
-
ActBERT: Learning Global-Local Video-Text Representations.
-
MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound.
-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation.
-
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision.
-
UNITER: UNiversal Image-TExt Representation Learning.
-
VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts.
-
FLAVA: A Foundational Language And Vision Alignment Model.
-
Vlmixer: Unpaired vision-language pre-training via cross-modal cutmix.
- ICML 2022 [paper]
-
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.
-
Unsupervised Vision-and-Language Pretraining via Retrieval-based Multi-Granular Alignment.
-
Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning.
-
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs.
-
Multimodal Deep Autoencoder for Human Pose Recovery.
- IEEE TIP 2015 [paper]
-
Self-supervised object detection from audio-visual correspondence.
- CVPR 2021 [paper]
-
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos.
-
Self-Supervised Learning of Audio-Visual Objects from Video.
-
Coot: Cooperative hierarchical transformer for video-text representation learning.
-
Unpaired Image Captioning via Scene Graph Alignments.
Applications
State Representation Learning
-
State representation learning for control: An overview
- Neural Networks 2018 [paper]
-
Unsupervised Representation Learning in Deep Reinforcement Learning: A Review
- arXiv 2022 [paper]
-
Action-Conditional Video Prediction using Deep Networks in Atari Games
-
Recurrent World Models Facilitate Policy Evolution
-
Learning latent dynamics for planning from pixels
-
Learning to Poke by Poking: Experiential Learning of Intuitive Physics
- NeurIPS 2016 [paper]
-
Learning Predictive Representations for Deformable Objects Using Contrastive Estimation
Healthcare
-
Multimodal biomedical AI.
- Nature Medicine 2022 [paper]
-
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text.
-
ContIG: Self-supervised multimodal contrastive learning for medical imaging with genetics.
-
CoMIR: Contrastive multimodal image representation for registration.
-
Contrastive learning of medical visual representations from paired images and text.
-
GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition.
-
Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning.
-
Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports.
Remote Sensing
-
Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model.
-
Self-Supervised SAR-Optical Data Fusion of Sentinel-1/-2 Images.
- IEEE Transactions on Geoscience and Remote Sensing 2022 [paper]
-
Semi-Supervised Learning for Joint SAR and Multispectral Land Cover Classification.
- IEEE Geoscience and Remote Sensing Letters 2021 [paper]
-
Self-Supervised Change Detection in Multiview Remote Sensing Images.
-
Self-Supervised Multisensor Change Detection.
-
Self-supervised Audiovisual Representation Learning for Remote Sensing Data.
Machine Translation
-
A Survey of Multilingual Neural Machine Translation.
- ACM Computing Surveys 2019 [paper]
-
Unsupervised Machine Translation Using Monolingual Corpora Only.
-
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation.
- TACL 2016 [paper]
-
Visual Grounding in Video for Unsupervised Word Translation.
-
Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders.
- ACL 2019 [paper]
-
The Missing Ingredient in Zero-Shot Neural Machine Translation.
- arXiv 2019 [paper]
Auto-driving
-
Multi-modal Sensor Fusion for Auto Driving Perception: A Survey.
- arXiv 2022 [paper]
-
Image-to-Lidar Self-Supervised Distillation for Autonomous Driving Data.
-
Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR.
-
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge.
-
Unsupervised Learning of Depth, Optical Flow and Pose with Occlusion from 3D Geometry.
Robotics
-
Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks.
- ICRA 2018 [paper]
-
Self-Supervised Visual Terrain Classification From Unsupervised Acoustic Feature Learning.
- IEEE Transactions on Robotics 2019 [paper]
-
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks.
- CVPR 2020 [paper]
-
Two stream networks for self-supervised ego-motion estimation.
- CoRL 2019 [paper]
-
Connecting Touch and Vision via Cross-Modal Prediction.
- CVPR 2019 [paper]
Challenges
Resources
-
Contrastive Vision-Language Pre-training with Limited Resources.
-
Beyond neural scaling laws: beating power law scaling via data pruning.
-
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.
Robustness/Fairness
-
When and why Vision-Language Models behave like Bags-of-Words, and what to do about it?.
-
Robustness Analysis of Video-Language Models Against Visual and Language Perturbations.
-
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty.
-
Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning.
-
Are Multimodal Models Robust to Image and Text Perturbations?.
-
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems.
- Interspeech 2022 [paper]
-
Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.
- arXiv 2021 [paper]
-
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning.
-
Multimodal datasets: misogyny, pornography, and malignant stereotypes.
- arXiv 2021 [paper]
-
On the opportunities and risks of foundation models.
- arXiv 2021 [paper]
-
Extracting Training Data from Diffusion Models.
- arXiv 2023 [paper]
-
Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models.
- Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP) 2022 [paper]
-
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers.
-
What makes for good views for contrastive learning?.
Summary of Common Multimodal Datasets
Image-Text Datasets
Name | # Images | # Text | Domain | Task | Access | Github |
---|---|---|---|---|---|---|
COCO | >330000 | >1.5M | Natural images | image captioning, image-text retrieval | Link | Github |
Flickr30k | 31,000 | 5 sentences for each image | Natural images | image captioning, image-text retrieval | Link | - |
FlickrStyle10K | 10,000 | 10,000 | Natural images | image captioning (stylized), image-text retrieval | Link | Github |
Flickr8k | 8,000 | 5 for each | Natural images | image captioning, image-text retrieval | Link | Github |
Flickr8k-CN | 8,000 | 8,000 | Natural images | image captioning, image-text retrieval | Link | Github |
SentiCap | 1671/1500 | 4892/3977 | Natural images | image captioning (stylized), image-text retrieval | Link | - |
SBU Captions | 1M | 1M | Natural images | image captioning, image-text retrieval | Link | Link |
Conceptual Captions | 3M | 3M | Natural images | image captioning, image-text retrieval | Link | Github |
AIC-ICC | 210K | 210K | Natural images | image captioning, image-text retrieval | Link | Github |
Wikipedia | 2866 | 2866 | document | image captioning, image-text retrieval | ? | Github |
NUS-WIDE-10K | 10K | 10K | Natural images | image captioning, image-text retrieval | Link | - |
Yelp | 200,100 | 6,990,280 | product review | summarization | Link | - |
VQA v2.0 | 204,721 | 1105904/11,059,040 (Q/A) | Natural images | VQA | Link | - |
ImageCLEF 2019 VQA-Med | 3825 | 3825 | Medicine | VQA | Link | Github |
VCR | 110k | 290k/290k/290k (Q/A/Rationale) | natural | visual commonsense reasoning (VCR) | Link | Github |
GD-VCR | 328 | 886/886(Q/A) | Geo-Diverse | visual commonsense reasoning (VCR) | Link | Github |
SNLI-VE | Details | Natural images | Visual Entailment | Link | Github | |
NLVR2 | 107,292 | 107,292 | Natural images | natural language for visual reasoning | Link | Github |
NLVR | 92244 | 92244 | synthetic images | natural language for visual reasoning | Link | Github |
rendered SST2 | ~1k | ~1k | image of text | optical character recognition (OCR) | Link | - |
OCR-CC | 1.4M | 1.4M | Natural images | optical character recognition (OCR) | Link | Github |
Hateful Memes | 10k+ | 10k+ | memes | optical character recognition (OCR) | Link | Github |
CORD | 1K | 1k | document | OCR | Link | Github |
RefCOCO+ | 19,992 | 141,564 | Natural images | Visual Grounding | Link | Github |
Image-Text-Audio Datasets
Name | # Images | # Text | Domain | Task | Access | Github |
---|---|---|---|---|---|---|
Localized Narratives | 848,749 | 873,107 | natural | Image captioning, Paragraph generation, VQA, Phrase grounding etc. | Link | Github |
open image | 0.6M | 0.6M | natural | Image captioning, detection, segmentation, VQA, etc | Link | Github |
Video-Text Datasets
Name | # Video / # clips | # Text | Domain | Task | link | Github |
---|---|---|---|---|---|---|
ActivityNet Captions | 20k: 100k | 100k | natural | Video Captioning, video-text retrieval | Link | Github |
V2C | 9k | 27k | natural (human action) | Video Captioning, video-text retrieval | Link | Github |
VATEX | 41.3k | 826k | natural | Video Captioning, video-text retrieval | Link | Github |
YouCook2 | 2k:15.4k | 15.4k | Cooking | Video Captioning, video-text retrieval | Link | Github |
Charades | 10k:10k | 27.8k | Indoor activity | Video Captioning, video-text retrieval, action recognition | Link | - |
MSR-VTT | 7k:10k | 200k | natural | Video Captioning, video-text retrieval | Link | Github |
MSVD | 2k:2k | 70k | natural | Video Captioning, video-text retrieval | - | - |
HowTo100M | 1.2M: 136M | 136M | instructional video | Video Captioning, video-text retrieval, action locolization | Link | - |
TGIF | 102k: 102k | 126k | animated GIFs | Video Captioning, video-text retrieval | Link | Github |
TACoS-MLevel | 185:25k | 75k | Cooking | Video Captioning, video-text retrieval | Link | - |
CrossTask | 4.7K:- | 4.7k | instructional | Temporal action localization | Link | Github |
MiningYoutube | 20k:200k | 200k | Cooking | Temporal action localization | Link | Github |
COIN | 11,827:- | 46,354 | 12 different domains | Action Segmentation | Link | Github |
Breakfast | -:11267 | 11267 | cooking | Action Segmentation | Link | - |
LSMDC | 200:128k | 128k | Movie | Video Captioning, video-text retrieval | Link | - |
HOMAGE | 1.75K | Indoor activity | Activity Classification | Link | Github |
Video-Audio Datasets
Name | # Video-audio | # utterance | Domain | Task | Access | Github |
---|---|---|---|---|---|---|
SoundNet | 2M | natural | Audio-visual correspondence | Link | Github | |
MUSIC | 714 | music instruments | Audio-visual correspondence | Link | Github | |
AVSpeech | 290k | Person | Audio-visual correspondence | Link | Github | |
URMP | 44 | music instruments | Audio-visual correspondence | Link | - | |
AV-Bench | v1 ~5k, v2 ~7k | natural | Audio-Visual Correspondence (AVC), Audio-Visual Event Localization (AVEL) and video parsing (AVVP), Sound Source Localization (SSL), etc. | Link | Github | |
AVE | 4143 | natural | temporal localization | Link | Github | |
360° video | 1146 | camera | Spatial Audio generation | Link | Github | |
openpose | person | Audio-visual correspondence, music-to-video generation | Link | Github | ||
LRS2 | - | 144481 | person | speech recognition, lips reading | Link | Github |
LRS3 | 9506 | 151819 | person | speech recognition, lips reading | Link | - |
Point Cloud Datasets
Name | # mesh | Domain | Task | Access | Github |
---|---|---|---|---|---|
ModelNet40 | 12,311 | CAD models | Classification, reconstruction | Link | Github |
ShapeNet | 220,000 | 3D models | Classification, reconstruction | Link | - |
ScanObjectNN | 2902 | real-world point cloud | Classification, reconstruction | Link | Github |
Image-Ridar Datasets
Name | # images | # points (M) | Domain | Task | link | Github |
---|---|---|---|---|---|---|
Eigen split KITTTI | 7481+7518 | 1799 | auto driving | detection | Link | - |
nuScenes | auto driving | 3D detection and tracking | Link | Github | ||
SemanticKITTI | 23201+20351 | 4549 | auto driving | segmentation | Link | Github |
Contribute
PR welcome using the following markdown format:
- Paper Name.
- *Conference Year*. [[paper]](link) [[code]](link)