Home

Awesome

Awesome Vision-and-Language Pre-TrainingAwesome

<p align="center"> <img width="250" src="https://camo.githubusercontent.com/1131548cf666e1150ebd2a52f44776d539f06324/68747470733a2f2f63646e2e7261776769742e636f6d2f73696e647265736f726875732f617765736f6d652f6d61737465722f6d656469612f6c6f676f2e737667" "Awesome!"> </p>

A curated list of vision-and-language pre-training. :-)

Contributing

Please feel free to send me pull requests or email (chihung.chan@outlook.com) to add links.

Table of Contents

Papers

Survey

SurveyAuthors
A Survey of Vision-Language Pre-Trained ModelsYifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao
VLP: A Survey on Vision-Language Pre-trainingFeilong Chen, Duzhen Zhang, Minglun Han, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu
Vision-and-Language Pretrained Models: A SurveySiqu Long, Feiqi Cao, Soyeon Caren Han, Haiqin Yang
Vision-and-Language PretrainingThong Nguyen, Cong-Duy Nguyen, Xiaobao Wu, Anh Tuan Luu

Research Paper

Fusion Encoders

MethodVenueReferenceAuthors
2019
VisualBERTArxiv-2019VisualBERT: A Simple and Performant Baseline for Vision and LanguageLiunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, Kai-Wei Chang
ViLBERTNeurIPS-2019ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language TasksJiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee
LXMERTEMNLP-2019LXMERT: Learning Cross-Modality Encoder Representations from TransformersHao Tan, Mohit Bansal
2020
ImageBERTArxiv-2020ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text DataDi Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, Arun Sacheti
InterBERTArxiv-2020InterBERT: Vision-and-Language Interaction for Multi-modal PretrainingJunyang Lin, An Yang, Yichang Zhang, Jie Liu, Jingren Zhou, Hongxia Yang
PixelBERTArxiv-2020Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal TransformersZhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, Jianlong Fu
VALUEECCV-2020Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language ModelsJize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
UNITERECCV-2020UNITER: UNiversal Image-TExt Representation LearningYen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
VisDial-BERTECCV-2020Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art BaselineVishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das
OSCARECCV-2020Oscar: Object-Semantics Aligned Pre-training for Vision-Language TasksXiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
X-LXMERTEMNLP-2020X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal TransformersJaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
Unicoder-VLAAAI-2020Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-TrainingGen Li, Nan Duan, Yuejian Fang, Ming Gong, Daxin Jiang, Ming Zhou
VLPAAAI-2020Unified Vision-Language Pre-Training for Image Captioning and VQALuowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao
ERNIE-ViLAAAI-2021ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene GraphFei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
VL-BERTICLR-2020VL-BERT: Pre-training of Generic Visual-Linguistic RepresentationsWeijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
12-IN-1CVPR-202012-in-1: Multi-Task Vision and Language Representation LearningJiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
VILLANeurIPS-2020Large-Scale Adversarial Training for Vision-and-Language Representation LearningZhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu
2021
X-VLMArxiv-2021Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual ConceptsYan Zeng, Xinsong Zhang, Hang Li
KD-VLPArxiv-2021KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge DistillationYongfei Liu, Chenfei Wu, Shao-yen Tseng, Vasudev Lal, Xuming He, Nan Duan
VLMOArixv-2021VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-ExpertsWenhui Wang, Hangbo Bao, Li Dong, Furu Wei
UNICORNArxiv-2021Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language ModelingZhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, Lijuan Wang
MANGOArxiv-2021A Closer Look at the Robustness of Vision-and-Language Pre-trained ModelsLinjie Li, Zhe Gan, Jingjing Liu
XGPTNLPCC-2021XGPT: Cross-modal Generative Pre-Training for Image CaptioningQiaolin Xia, Haoyang Huang, Nan Duan, Dongdong Zhang, Lei Ji, Zhifang Sui, Edward Cui, Taroon Bharti, Xin Liu, Ming Zhou
ROSITAACMMM-2021ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge IntegrationYuhao Cui, Zhou Yu, Chunqi Wang, Zhongzhou Zhao, Ji Zhang, Meng Wang, Jun Yu
AnalysisFindings-2021Does Vision-and-Language Pretraining Improve Lexical Grounding?Tian Yun, Chen Sun, Ellie Pavlick
AnalysisTACL-2021Decoupling the Role of Data, Attention, and Losses in Multimodal TransformersLisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac, Aida Nematzadeh
VoltaTACL-2021Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTsEmanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott
VL-T5ICML-2021Unifying Vision-and-Language Tasks via Text GenerationJaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal
ViLTICML-2021ViLT: Vision-and-Language Transformer Without Convolution or Region SupervisionWonjae Kim, Bokyung Son, Ildoo Kim
Visual ParsingNeurIPS-2021Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-trainingHongwei Xue, Yupan Huang, Bei Liu, Houwen Peng, Jianlong Fu, Houqiang Li, Jiebo Luo
ALBEFNeurIPS-2021Align before Fuse: Vision and Language Representation Learning with Momentum DistillationJunnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi
E2E-VLPACL-2021E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual LearningHaiyang Xu, Ming Yan, Chenliang Li, Bin Bi, Songfang Huang, Wenming Xiao, Fei Huang
SOHOCVPR-2021Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation LearningZhicheng Huang, Zhaoyang Zeng, Yupan Huang, Bei Liu, Dongmei Fu, Jianlong Fu
VLN-BERTCVPR-2021A Recurrent Vision-and-Language BERT for NavigationYicong Hong, Qi Wu, Yuankai Qi, Cristian Rodriguez-Opazo, Stephen Gould
VinVLCVPR-2021VinVL: Revisiting Visual Representations in Vision-Language ModelsPengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
SimVLMICLR-2021SimVLM: Simple Visual Language Model Pretraining with Weak SupervisionZirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, Yuan Cao
2022
mPLUGArxiv-2022mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connectionsChenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou
CoCaArxiv-2022Contrastive Captioners are Image-Text Foundation ModelsJiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu
FlamingoArxiv-2022Flamingo: a Visual Language Model for Few-Shot LearningJean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, Karen Simonyan
BLIPArxiv-2022BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationJunnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
Bridge-TowerArxiv-2022Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation LearningXiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Nan Duan
VLMbenchArxiv-2022VLMbench: A Compositional Benchmark for Vision-and-Language ManipulationKaizhi Zheng, Xiaotong Chen, Odest Chadwicke Jenkins, Xin Eric Wang
MixGenArxiv-2022MixGen: A New Multi-Modal Data AugmentationXiaoshuai Hao, Yi Zhu, Srikar Appalaraju, Aston Zhang, Wanqian Zhang, Bo Li, Mu Li
DaVinciArxiv-2022Prefix Language Models are Unified Modal LearnersShizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang
MetaLMArxiv-2022Language Models are General-Purpose InterfaceYaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, Furu Wei
VL-BEITArxiv-2022VL-BEIT: Generative Vision-Language PretrainingHangbo Bao, Wenhui Wang, Li Dong, Furu Wei
VLUEArxiv-2022VLUE: A Multi-Task Benchmark for Evaluating Vision-Language ModelsWangchunshu Zhou, Yan Zeng, Shizhe Diao, Xinsong Zhang
VL-CheckListArxiv-2022VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and RelationsTiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, Jianwei Yin
AnalysisAAAI-2022Are Vision-Language Transformers Learning Multimodal Representations? A Probing PerspectiveEmmanuelle Salin, Badreddine Farah, Stéphane Ayache, Benoit Favre
CLIP-ViLICLR-2022How Much Can CLIP Benefit Vision-and-Language Tasks?Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
METERCVPR-2022An Empirical Study of Training End-to-End Vision-and-Language TransformersZi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, Michael Zeng
UVLPCVPR-2022Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular AlignmentMingyang Zhou, Licheng Yu, Amanpreet Singh, Mengjiao Wang, Zhou Yu, Ning Zhang
TCLCVPR-2022Vision-Language Pre-Training with Triple Contrastive LearningJinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi, Junzhou Huang
OFAICML-2022Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkPeng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
VLMixerICML-2022VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMixTeng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo

Dual Encoders

MethodVenueReferenceAuthors
2021
ALIGNArxiv-2021Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionChao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig
FILIPArxiv-2021FILIP: Fine-grained Interactive Language-Image Pre-TrainingLewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu
SLIPArxiv-2021SLIP: Self-supervision meets Language-Image Pre-trainingNorman Mu, Alexander Kirillov, David Wagner, Saining Xie
CLIPICML-2021Learning Transferable Visual Models From Natural Language SupervisionAlec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
2022
AnalysisArxiv-2022Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt
ProtoCLIPArxiv-2022Prototypical Contrastive Language Image PretrainingDelong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Yixiang Huang, Yiping Bao, Erjin Zhou

Unified Models

MethodVenueReferenceAuthors
2021
ViT-BERTArxiv-2021Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and TextQing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
UNIMOACL-2021UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive LearningWei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, Haifeng Wang
2022
SkillNetArxiv-2022One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and CodeYong Dai, Duyu Tang, Liangxin Liu, Minghuan Tan, Cong Zhou, Jingquan Wang, Zhangyin Feng, Fan Zhang, Xueyu Hu, Shuming Shi
data2vecArxiv-2022data2vec: A General Framework for Self-supervised Learning in Speech, Vision and LanguageAlexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli
UNIFIED-IOArxiv-2022UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKSJiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, Aniruddha Kembhavi
Uni-PerceiverCVPR-2022Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot TasksXizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai
FLAVACVPR-2022FLAVA: A Foundational Language And Vision Alignment ModelAmanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

Datasets

DatasetImagesImage-Text PairsDuration (hrs)Note
SBU875k875k-reference, website
FLIKR29k145k-reference, website
COCO113k567k-reference, website
COCO/OI Narratives849k873k-reference, website
VG108k5.4m-reference, website
VGQA108k1.8m-reference, website
VQA83k444k-reference, website
GQA82k1m-reference, website
CC3M3m3m-reference, website
CC12M12m12m-reference, website
YFCC-15M15m15m-reference, website
WebImageText400m400m-reference
LAION-400M400m400m-website
LAION-2B2b2b-website
RedCaps12m12mreference, website
AltText1.8b1.8b-reference
ImageNet-Captions464k464k-reference, website
Kinetics--1.4kreference, website
TVQA--0.4kreference, website
HT100M--134kreference, website
WebVid2M--13kreference, website

Evaluation

The following contents are adapted from this survey.

TaskDescription
1. Classification
Visual Question Answering (VQA)Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question.
Visual Reasoning and Compositional Question Answering (GQA)GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes.
Natural Language for Visual Reasoning (NLVR)The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false).
Visual Entailment (VE)In the VE task, image is the premise, and text is the hypothesis. Our goal is to predict whether the text is "Entailment Image". There are three labels, Entailment, Neutral, and Contradiction.
Visual Commonsense Reasoning (VCR)VCR exists in the form of multiple-choice questions. For a question, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons.
Grounding Referring Expressions (GRE)The GRE task is to localize an image region given a text reference. The model can output a score for each region, and the region with the highest score is used as the prediction region.
Visual Spatial Reasoning (VSR)The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False).
2. Regression
Multi-modal Sentiment Analysis (MSA)MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable.
3. Retrieval
Vision-Language RetrievalVLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa.
4. Generation
Visual Captioning (VC)VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input.
Novel Object Captioning at Scale (NoCaps)NoCaps extends the VC task to test a model's capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus.
Visual Dialogue (VD)The task form of VD is given an image (or video), a dialogue history, and a language question, and let the model generate an answer for the question.
5. Others
Multi-modal Machine Translation (MMT)MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, i.e., image.
Vision-Language Navigation (VLN)VLN is a grounding language task of an agent's locomotion as it sees and explores the real-world dynamics based on linguistic instructions.
Optical Character Recognition (OCR)OCR generally refers to detecting and recognizing text information in images, which includes two parts: text detection (similar to regression) and text recognition (similar to classification).

Tutorials

Licenses

CC0

To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.

Acknowledgement

This repo started from this survey. We thank the authors for their comprehensive review of existing studies.