Home

Awesome

Awesome Vision-and-Language: Awesome

A curated list of awesome vision and language resources, inspired by awesome-computer-vision.

Table Of Contents

Survey

TitleConference / JournalPaperCodeRemarks
A Survey of Current Datasets for Vision and Language Research2015 EMNLP1506.06833
Multimodal Machine Learning: A Survey and Taxonomy1705.09406
A Comprehensive Survey of Deep Learning for Image Captioning1810.04020
Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods1907.09358
A Survey of Scene Graph Generation and ApplicationScene-Graph-Survey
Challenges and Prospects in Vision and Language Research1904.09317
Deep Multimodal Representation Learning: A Survey2019 ACCESSACCESS 2019
Multimodal Intelligence: Representation Learning, Information Fusion, and Applications1911.03977
Vision and Language: from Visual Perception to Content Creation2020 APSIPA1912.11872
Multimodal Research in Vision and Language: A Review of Current and Emerging Trends2010.09522

Dataset

TitleConference / JournalPaperCodeRemarks
VQA: Visual Question Answering2015 ICCV1505.00468visualqa
Visual Storytelling2016 NAACL1604.03968ai-visual-storytelling-seq2seqVIST
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations2017 IJCV1602.07332visual_genome_python_drivervisualgenome
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning2017 CVPR1612.06890
AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions2018 CVPR1705.08421AVA
Embodied Question Answering2018 CVPR1711.11543embodiedqa
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments2018 CVPR1711.07280bringmeaspoon
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering2019 CVPR1902.09506visualreasoning
From Recognition to Cognition: Visual Commonsense Reasoning2019 CVPR1811.10830r2cVCR
VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research2019 ICCV1904.03493
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning2020 NeurIPS2010.00763Bongard-LOGO
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions2022 CVPR2205.13803Bongard-HOI

Image Captioning

TitleConference / JournalPaperCodeRemarks
Long-term Recurrent Convolutional Networks for Visual Recognition and Description2015 CVPR1411.4389
Deep Visual-Semantic Alignments for Generating Image Descriptions2015 CVPR1412.2306
Show and Tell A Neural Image Caption Generator2015 CVPR1411.4555show_and_tell.tensorflow
Show, Attend and Tell Neural Image Caption Generation with Visual Attention2015 ICML1502.03044show-attend-and-tell
From Captions to Visual Concepts and Back2015 CVPR1411.4952visual-concepts
Image Captioning with Semantic Attention2016 CVPR1603.03925semantic-attention
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning2017 CVPR1612.01887AdaptiveAttention
Self-critical Sequence Training for Image Captioning2017 CVPR1612.00563
A Hierarchical Approach for Generating Descriptive Image Paragraphs2017 CVPR1611.06607
Deep reinforcement learning-based image captioning with embedding reward2017 CVPR1704.03899
Semantic compositional networks for visual captioning2017 CVPR1611.08002Semantic_Compositional_Nets
StyleNet: Generating Attractive Visual Captions with Styles2017 CVPRCVPR 2017stylenet
Training for Diversity in Image Paragraph Captioning2018 EMNLPENNLP 2018image-paragraph-captioning
Neural Baby Talk2018 CVPR1803.09845NeuralBabyTalk
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering2018 CVPR1707.07998
“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and Attention2018 ECCV1807.03871
Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation2019 AAAI1805.08191
Unsupervised Image Captioning2019 CVPR1811.10787unsupervised_captioning
Context-aware visual policy network for fine-grained image captioning2019 TPAMI1906.02365CAVP
Dense Relational Captioning Triple-Stream Networks for Relationship-Based Captioning2019 CVPR1903.05942
Describing like Humans on Diversity in Image Captioning2019 CVPR1903.12020
Good News, Everyone! Context driven entity-aware captioning for news images2019 CVPR1904.01475
Auto-Encoding Scene Graphs for Image Captioning2019 CVPR1812.02378SGAE
Unsupervised Image Captioning2019 CVPR1811.10787unsupervised_captioning
MSCap: Multi-Style Image Captioning with Unpaired Stylized Text2019 CVPRCVPR 2019
Robust Change Captioning2019 ICCV1901.02527
Attention on Attention for Image Captioning2019 ICCV1908.06954
Context-Aware Group Captioning via Self-Attention and Contrastive Features2020 CVPR2004.03708
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs2020 CVPR2003.00387asg2cap
Comprehensive Image Captioning via Scene Graph Decomposition2020 ECCV2007.11731Sub-GC
Are scene graphs good enough to improve Image Captioning?2020 AACL2009.12313
SG2Caps: Revisiting Scene Graphs for Image Captioning2021 arxiv2102.04990

Image Retrieval

TitleConference / JournalPaperCodeRemarks
Visual Word2Vec (vis-w2v) Learning Visually Grounded Word Embeddings Using Abstract Scenes2016 CVPR1511.07067VisualWord2Vec
Composing Text and Image for Image Retrieval - An Empirical Odyssey2019 CVPR1812.07119tirg
Learning Relation Alignment for Calibrated Cross-modal Retrieval2021 ACL2105.13868IAIS
ImageCoDe: Image Retrieval from Contextual Descriptions2022 ACL2203.15867ImageCoDe
Assessing Brittleness of Image-Text Retrieval Benchmarks from Vision-Language Models Perspective2407.15239
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers2024 ECCV2311.17136UniIR
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval2024 ECCV2407.12346Q-Pert

Scene Text Recognition

TitleConference / JournalPaperCodeRemarks
Towards Unconstrained End-to-End Text Spotting2019 ICCV1908.09231
What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis2019 ICCV1904.01906clovaai

Scene Graph

TitleConference / JournalPaperCodeRemarks
Image Retrieval Using Scene Graphs2015 CVPR7298990
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations2017 IJCV1602.07332visual_genome_python_drivervisualgenome
Scene Graph Generation by Iterative Message Passing2017 CVPR1701.02426scene-graph-TF-release
Scene Graph Generation from Objects, Phrases and Region Captions2017 ICCV1707.09700MSDN
Neural Motifs: Scene Graph Parsing with Global Context2018 CVPR1711.06640neural-motifs
Generating Triples with Adversarial Networks for Scene Graph Construction2018 AAAI1802.02598
LinkNet: Relational Embedding for Scene Graph2018 NIPS1811.06410
Image Generation from Scene Graphs2018 CVPR1804.01622sg2im
Graph R-CNN for Scene Graph Generation2018 ECCV1808.00191graph-rcnn.pytorch
Scene Graph Generation with External Knowledge and Image Reconstruction2019 CVPR1904.00560
Specifying Object Attributes and Relations in Interactive Scene Generation2019 ICCV1909.05379scene_generation
Attentive Relational Networks for Mapping Images to Scene Graphs2019 CVPR1811.10696
Exploring Context and Visual Pattern of Relationship for Scene Graph Generation2019 CVPRsceneGraph_Mem
Graphical Contrastive Losses for Scene Graph Parsing2019 CVPR1903.02728ContrastiveLosses4VRD
Knowledge-Embedded Routing Network for Scene Graph Generation2019 CVPR1903.03326KERN
Learning to Compose Dynamic Tree Structures for Visual Contexts2019 CVPR1812.01880VCTree
Counterfactual Critic Multi-Agent Training for Scene Graph Generation2019 ICCV1812.02347
Scene Graph Prediction with Limited Labels2019 ICCV1904.11622limited-label
Unbiased Scene Graph Generation from Biased Training2020 CVPR2002.11949Scene-Graph-Benchmark
GPS-Net Graph Property Sensing Network for Scene Graph Generation2020 CVPR2003.12962GPS-Net
Learning Visual Commonsense for Robust Scene Graph Generation2020 ECCV2006.09623
Sketching Image Gist Human-Mimetic Hierarchical Scene Graph Generation2020 ECCV2007.08760het-eccv20

text2image

TitleConference / JournalPaperCodeRemarks
Generative Adversarial Text to Image Synthesis2016 ICML1605.05396icml2016
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks2017 ICCV1612.03242StackGAN
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks2018 CVPR1711.10485AttnGAN
Photographic Text-to-Image Synthesis with a Hierarchically-nested Adversarial Network2018 CVPR1802.09178HDGan
StoryGAN: A Sequential Conditional GAN for Story Visualization2019 CVPR1812.02784StoryGAN
MirrorGAN: Learning Text-to-image Generation by Redescription2019 CVPR1903.05854
DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis2019 CVPR1904.01310
Semantics Disentangling for Text-to-Image Generation2019 CVPR1904.01480
Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction2019 ICCV1811.09845GeNeVA
Specifying Object Attributes and Relations in Interactive Scene Generation2019 ICCV1909.05379scene_generation

Video Captioning

TitleConference / JournalPaperCodeRemarks
Long-term Recurrent Convolutional Networks for Visual Recognition and Description2015 CVPR1411.4389
Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks2016 CVPR1510.07712
Attention-Based Multimodal Fusion for Video Description2017 CVPR1701.03126
Semantic compositional networks for visual captioning2017 CVPR1611.08002
Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description2017 CVPRCVPR_2017
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning2018 CVPR1804.00100
Adversarial Inference for Multi-Sentence Video Description2019 CVPR1812.05634adv-inf
Streamlined Dense Video Captioning2019 CVPR1904.03870DenseVideoCaptioning
Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning2019 CVPR1906.04375
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering2021 WACV2011.07735iPerceive

Video Question Answering

TitleConference / JournalPaperCodeRemarks
Movieqa: Understanding stories in movies through question-answering2016 CVPR1512.02902MovieQA
TVQA: Localized, Compositional Video Question Answering2018 EMNLP1809.01696TVQA
Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions2020 ECCV2007.08751ROLL-VideoQA
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering2021 WACV2011.07735iPerceive

Video Understanding

TitleConference / JournalPaperCodeRemarks
TSM: Temporal Shift Module for Efficient Video Understanding2019 ICCV1811.08383temporal-shift-module
A Graph-Based Framework to Bridge Movies and Synopses2019 ICCV1910.11009

Vision and Language Navigation

TitleConference / JournalPaperCodeRemarks
Embodied Question Answering2018 CVPR1711.11543embodiedqa
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments2018 CVPR1711.07280bringmeaspoon
Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation2023 NeurIPSfda_pdffda_code
Memory-adaptive vision-and-language navigation2024 PRmam_paper

Vision-and-Language Pretraining

TitleConference / JournalPaperCodeRemarks
LXMERT: Learning Cross-Modality Encoder Representations from Transformers2019 EMNLP1908.07490lxmert
VideoBERT: A Joint Model for Video and Language Representation Learning2019 ICCV1904.01766
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks2019 NIPSvilbert
OmniNet: A unified architecture for multi-modal multi-task learning2019 arxiv1907.07804OmniNet
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training2020 AAAI1908.06066Unicoder
Unified Vision-Language Pre-Training for Image Captioning and VQA2020 AAAI1909.11059VLP
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks2020 ECCV1911.11237Oscar
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments2020 NIPS2006.09882swav
Learning to Learn Words from Visual Scenes2020 ECCV2004.06165Oscar
ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs2021 AAAI2006.16934ERNIE
VinVL: Revisiting Visual Representations in Vision-Language Models2021 CVPR2101.00529VinVL
VirTex: Learning Visual Representations from Textual Annotations2021 CVPR2006.06666virtex
Learning Transferable Visual Models From Natural Language Supervision2021 arxiv2103.00020
Pretrained Transformers As Universal Computation Engines2021 arxiv2103.05247universal-computation
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision2021 arxiv2102.05918
Self-supervised Pretraining of Visual Features in the Wild2021 arxiv2103.01988
Transformer is All You Need Multimodal Multitask Learning with a Unified Transformer2021 arxiv2102.10772
Zero-Shot Text-to-Image Generation2021 arxiv2102.12092
WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training2021 arxiv2103.06561
Improved baselines for vision-language pre-training2023 arxiv2305.08675

Visual Dialog

TitleConference / JournalPaperCodeRemarks
Visual Dialog2017 CVPR1611.08669visdialvisualdialog
Two Can Play This Game: Visual Dialog With Discriminative Question Generation and Answering2018 CVPR1803.11186
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation20232303.05983ATVC

Visual Grounding

TitleConference / JournalPaperCodeRemarks
Modeling Relationships in Referential Expressions with Compositional Modular Networks2017 CVPR1611.09978cmn
Phrase Localization Without Paired Training Examples2019 ICCV1908.07553
Learning to Assemble Neural Module Tree Networks for Visual Grounding2019 ICCV1812.03299
A Fast and Accurate One-Stage Approach to Visual Grounding2019 ICCV1908.06354
Zero-Shot Grounding of Objects from Natural Language Queries2019 ICCV1908.07129zsgnet
Collaborative Transformers for Grounded Situation Recognition2022 CVPR2203.16518CoFormer

Visual Question Answering

TitleConference / JournalPaperCodeRemarks
VQA: Visual Question Answering2015 ICCV1505.00468visualqa
Hierarchical question-image co-attention for visual question answering2016 NIPS1606.00061HieCoAttenVQA
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding2016 EMNLP1606.01847vqa-mcb
Stacked Attention Networks for Image Question Answering2016 CVPR1511.02274imageqa-san
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering2016 ECCV1511.05234AAAA
Dynamic Memory Networks for Visual and Textual Question Answering2016 ICML1603.01417dmn-plus
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding2016 EMNLP1606.01847vqa-mcb
Multimodal Residual Learning for Visual QA2016 NIPS1606.01455nips-mrn-vqa
Graph-Structured Representations for Visual Question Answering2017 CVPR1609.05600
Making the V in VQA Matter Elevating the Role of Image Understanding in Visual Question Answering2017 CVPR1612.00837
Learning to Reason: End-to-End Module Networks for Visual Question Answering2017 ICCV1704.05526
Explicit Reasoning over End-to-End Neural Architectures for Visual Question Answering2018 AAAI1803.08896PSLQA
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering2018 CVPR1707.07998
Tips and Tricks for Visual Question Answering Learnings from the 2017 Challenge2018 CVPR1708.02711vqa-winner
Transfer Learning via Unsupervised Task Discovery for Visual Question Answering2019 CVPR1810.02358VQA-Transfer-ExternalData
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering2019 CVPR1902.09506visualreasoning
Towards VQA Models That Can Read2019 CVPR1904.08920
From Strings to Things: Knowledge-enabled VQA Model that can Read and Reason2019 ICCVICCV2019
An Empirical Study on Leveraging Scene Graphs for Visual Question Answering2019 BMVC1907.12133scene-graphs-vqa
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning2022 ICLR2204.11167RelViT
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation2022 arXiv2208.01813TAG

Visual Reasoning

TitleConference / JournalPaperCodeRemarks
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning2017 CVPR1612.06890
Inferring and Executing Programs for Visual Reasoning2017 ICCV1705.03633
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering2019 CVPR1902.09506visualreasoning
Explainable and Explicit Visual Reasoning over Scene Graphs2019 CVPR1812.01855
From Recognition to Cognition: Visual Commonsense Reasoning2019 CVPR1811.10830r2cVCR
Dynamic Graph Attention for Referring Expression Comprehension2019 ICCV1909.08164
Visual Semantic Reasoning for Image-Text Matching2019 ICCV1909.02701VSRN
Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning2020 NeurIPS2010.00763Bongard-LOGO
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions2022 CVPR2205.13803Bongard-HOI
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning2022 ICLR2204.11167RelViT
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization2023 ICCV2307.15199PromptStyler

Visual Relationship Detection

TitleConference / JournalPaperCodeRemarks
Visual Relationship Detection with Language Priors2016 ECCV1608.00187Visual-Relationship-Detection
ViP-CNN: Visual Phrase Guided Convolutional Neural Network2017 CVPR1702.07191
Visual Translation Embedding Network for Visual Relation Detection2017 CVPR1702.08319drnet
Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection2017 CVPR1703.03054DeepVariationRL
Detecting Visual Relationships with Deep Relational Networks2017 CVPR1704.03114drnet
Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues2017 ICCV1611.06641pl-clc
Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation2017 ICCV1707.09423
Referring Relationships2018 CVPR1803.10362ReferringRelationships
Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition2018 ECCV1807.04979ZoomNet
Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features2018 ECCV1808.00171vrd
Leveraging Auxiliary Text for Deep Recognition of Unseen Visual Relationships2020 ICLR1910.12324

Visual Storytelling

TitleConference / JournalPaperCodeRemarks
Visual Storytelling2016 NAACL1604.03968visual_genome_python_driverVIST
No Metrics Are Perfect Adversarial Reward Learning for Visual Storytelling2018 ACL1804.09160AREL
Show, Reward and Tell: Automatic Generation of Narrative Paragraph from Photo Stream by Adversarial Training2018 AAAI
Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling2020 AAAI2002.00774
Storytelling from an Image Stream Using Scene Graphs2020 AAAIAAAI 2020

Contributing

Please feel free to send me pull requests or email (shmwoo9395@gmail.com) to add links.

Licenses

License

CC0

To the extent possible under law, Sangmin Woo has waived all copyright and related or neighboring rights to this work.