Awesome

Awesome Image Captioning

A curated list of image captioning and related area. :-)

Contributing

Please feel free to send me pull requests or email (chihung.chan@outlook.com) to add links. Markdown format:

- [Paper Name](link) - Author 1 et al, `Conference Year`. [[code]](link)

Change Log

May 25 An up-to-date paper list about vision-and-language pre-training is available here.

Papers
- Survey
- Before - 2015 - 2016 - 2017 - 2018 - 2019 - 2020
Dataset
Image Captioning Challenge
Popular Implementations
- PyTorch
- TensorFlow
- Torch
- Others

Papers

Survey

A Comprehensive Survey of Deep Learning for Image Captioning - Hossain M et al, arXiv preprint 2018.

Before

I2t: Image parsing to text description - Yao B Z et al, P IEEE 2011.
Im2Text: Describing Images Using 1 Million Captioned Photographs - Ordonez V et al, NIPS 2011. [project web]
Deep Captioning with Multimodal Recurrent Neural Networks - Mao J et al, arXiv preprint 2014.

2015

2016

`CVPR 2016`

Image captioning with semantic attention - You Q et al, CVPR 2016. [code]
DenseCap: Fully Convolutional Localization Networks for Dense Captioning - Johnson J et al, CVPR 2016. [code]
What value do explicit high level concepts have in vision to language problems? - Wu Q et al, CVPR 2016.
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data - Lisa Anne Hendricks et al, CVPR 2016. [code]
SPICE: Semantic Propositional Image Caption Evaluation - Anderson P et al, ECCV 2016. [code]

`ACMMM 2016`

Image Captioning with Deep Bidirectional LSTMs - Wang C et al, ACMMM 2016. [code]

`ACL 2016`

Multimodal Pivots for Image Caption Translation - Hitschler J et al, ACL 2016.

`arXiv preprint 2016`

Image Caption Generation with Text-Conditional Semantic Attention - Zhou L et al, arXiv preprint 2016. [code]
DeepDiary: Automatic Caption Generation for Lifelogging Image Streams - Fan C et al, arXiv preprint 2016.
Learning to generalize to new compositions in image understanding - Atzmon Y et al, arXiv preprint 2016.
Generating captions without looking beyond objects - Heuer H et al, arXiv preprint 2016.
Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning - Chen W et al, arXiv preprint 2016. [code]
Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering - Liu H et al, arXiv preprint 2016.
Recurrent Highway Networks with Language CNN for Image Captioning - Gu J et al, arXiv preprint 2016.

2017

`CVPR 2017`

Captioning Images with Diverse Objects - Venugopalan S et al, CVPR 2017. [code]
Top-down Visual Saliency Guided by Captions - Ramanishka V et al, CVPR 2017. [code]
Self-Critical Sequence Training for Image Captioning - Steven J et al, CVPR 2017. [code]
Dense Captioning with Joint Inference and Visual Context - Yang L et al, CVPR 2017. [code]
Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition - Yufei W et al, CVPR 2017. [code]
A Hierarchical Approach for Generating Descriptive Image Paragraphs - Krause J et al, CVPR 2017. [code]
Deep Reinforcement Learning-based Image Captioning with Embedding Reward - Ren Z et al, CVPR 2017.
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects - Ting Y et al, CVPR 2017.
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning - Lu J et al, CVPR 2017. [code]
Attend to You: Personalized Image Captioning with Context Sequence Memory Networks - CC Park et al, CVPR 2017. [code]
SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning - Chen L et al, CVPR 2017. [code]
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-In-The-Blank Image Captioning - Qing S et al, CVPR 2017.

`ICCV 2017`

Areas of Attention for Image Captioning - Pedersoli M et al, ICCV 2017.
Boosting Image Captioning with Attributes - Yao T et al, ICCV 2017.
An Empirical Study of Language CNN for Image Captioning - Gu J et al, ICCV 2017.
Improved Image Captioning via Policy Gradient Optimization of SPIDEr - Liu S et al, ICCV 2017.
Towards Diverse and Natural Image Descriptions via a Conditional GAN - Dai B et al, ICCV 2017. [code]
Paying Attention to Descriptions Generated by Image Captioning Models - Tavakoliy H R et al, ICCV 2017.
Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner - Chen T H et al, ICCV 2017. [code]

`AAAI 2017`

Image Caption with Global-Local Attention - Li L et al, AAAI 2017.
Reference Based LSTM for Image Captioning - Chen M et al, AAAI 2017.
Attention Correctness in Neural Image Captioning - Liu C et al, AAAI 2017.
Text-guided Attention Model for Image Captioning - Mun J et al, AAAI 2017. [code]

`NIPS 2017`

Contrastive Learning for Image Captioning - Dai B et al, NIPS 2017. [code]

`TPAMI 2017`

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge - Vinyals O et al, TPAMI 2017. [code]

`arXiv preprint 2017`

MAT: A Multimodal Attentive Translator for Image Captioning - Liu C et al, arXiv preprint 2017.
Actor-Critic Sequence Training for Image Captioning - Zhang L et al, arXiv preprint 2017.
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? - Tanti M et al, arXiv preprint 2017. [code]
Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning - Xian Y et al, arXiv preprint 2017.
Phrase-based Image Captioning with Hierarchical LSTM Model - Tan Y H et al, arXiv preprint 2017.
Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning - Chen H et al, arXiv preprint 2017.

2018

`CVPR 2018`

Neural Baby Talk - Lu J et al, CVPR 2018. [code]
Convolutional Image Captioning - Aneja J et al, CVPR 2018.
Learning to Evaluate Image Captioning - Cui Y et al, CVPR 2018. [code]
Discriminability Objective for Training Descriptive Captions - Luo R et al, CVPR 2018. [code]
SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text - Mathews A et al, CVPR 2018.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering - Anderson P et al, CVPR 2018. [code]
GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints - Chen F et al, CVPR 2018.

`ECCV 2018`

Unpaired Image Captioning by Language Pivoting - Gu J et al, ECCV 2018.
Recurrent Fusion Network for Image Captioning - Jiang W et al, ECCV 2018.
Exploring Visual Relationship for Image Captioning - Yao T et al, ECCV 2018.
Rethinking the Form of Latent States in Image Captioning - Dai B et al, ECCV 2018. [code]
Boosted Attention: Leveraging Human Attention for Image Captioning - Chen S et al, ECCV 2018.
"Factual" or "Emotional": Stylized Image Captioning with Adaptive Learning and Attention - Chen T et al, ECCV 2018.

`AAAI 2018`

Learning to Guide Decoding for Image Captioning - Jiang W et al, AAAI 2018.
Stack-Captioning: Coarse-to-Fine Learning for Image Captioning - Gu J et al, AAAI 2018. [code]
Temporal-difference Learning with Sampling Baseline for Image Captioning - Chen H et al, AAAI 2018.

`NeurIPS 2018`

Partially-Supervised Image Captioning - Anderson P et al, NeurIPS 2018.
A Neural Compositional Paradigm for Image Captioning - Dai B et al, NeurIPS 2018.

`NAACL 2018`

Defoiling Foiled Image Captions - Wang J et al, NAACL 2018.
Punny Captions: Witty Wordplay in Image Descriptions - Chandrasekaran A et al, NAACL 2018. [code]
Object Counts! Bringing Explicit Detections Back into Image Captioning - Aneja J et al, NAACL 2018.

`ACL 2018`

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning - Sharma P et al, ACL 2018. [code]
Attacking visual language grounding with adversarial examples: A case study on neural image captioning - Chen H et al, ACL 2018. [code]

`EMNLP 2018`

simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions - Liu et al, EMNLP 2018. [code]

`arXiv preprint 2018`

Improved Image Captioning with Adversarial Semantic Alignment - Melnyk I et al, arXiv preprint 2018.
Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al, arXiv preprint 2018.
CNN+CNN: Convolutional Decoders for Image Captioning - Wang Q et al, arXiv preprint 2018.
Diverse and Controllable Image Captioning with Part-of-Speech Guidance - Deshpande A et al, arXiv preprint 2018.

2019

`CVPR 2019`

Unsupervised Image Captioning - Yang F et al, CVPR 2019. [code]
Engaging Image Captioning Via Personality - Shuster K et al, CVPR 2019.
Pointing Novel Objects in Image Captioning - Li Y et al, CVPR 2019.
Auto-Encoding Scene Graphs for Image Captioning - Yang X et al, CVPR 2019.
Context and Attribute Grounded Dense Captioning - Yin G et al, CVPR 2019.
Look Back and Predict Forward in Image Captioning - Qin Y et al, CVPR 2019.
Self-critical n-step Training for Image Captioning - Gao J et al, CVPR 2019.
Intention Oriented Image Captions with Guiding Objects - Zheng Y et al, CVPR 2019.
Describing like humans: on diversity in image captioning - Wang Q et al, CVPR 2019.
Adversarial Semantic Alignment for Improved Image Captions - Dognin P et al, CVPR 2019.
MSCap: Multi-Style Image Captioning With Unpaired Stylized Text - Gao L et al, CVPR 2019.
Fast, Diverse and Accurate Image Captioning Guided By Part-of-Speech - Aditya D et al, CVPR 2019.
Good News, Everyone! Context driven entity-aware captioning for news images - Biten A F et al, CVPR 2019. [code]
CapSal: Leveraging Captioning to Boost Semantics for Salient Object Detection - Zhang L et al, CVPR 2019. [code]
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning - Kim D et al, CVPR 2019. [code]
Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions - Cornia M et al, CVPR 2019. [code]
Exact Adversarial Attack to Image Captioning via Structured Output Learning With Latent Variables - Xu Y et al, CVPR 2019.

`AAAI 2019`

Meta Learning for Image Captioning - Li N et al, AAAI 2019.
Learning Object Context for Dense Captioning - Li X et al, AAAI 2019.
Hierarchical Attention Network for Image Captioning - Wang W et al, AAAI 2019.
Deliberate Residual based Attention Network for Image Captioning - Gao L et al, AAAI 2019.
Improving Image Captioning with Conditional Generative Adversarial Nets - Chen C et al, AAAI 2019.
Connecting Language to Images: A Progressive Attention-Guided Network for Simultaneous Image Captioning and Language Grounding - Song L et al, AAAI 2019.

`ACL 2019`

Dense Procedure Captioning in Narrated Instructional Videos - Shi B et al, ACL 2019.
Informative Image Captioning with External Sources of Information - Zhao S et al, ACL 2019.
Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning - Fan Z et al, ACL 2019.

`BMVC 2019`

Image Captioning with Unseen Objects - Demirel et al, BMVC 2019.
Look and Modify: Modification Networks for Image Captioning - Sammani et al, BMVC 2019. [code]
Show, Infer and Tell: Contextual Inference for Creative Captioning - Khare et al, BMVC 2019. [code]
SC-RANK: Improving Convolutional Image Captioning with Self-Critical Learning and Ranking Metric-based Reward - Yan et al, BMVC 2019.

`ICCV 2019`

Hierarchy Parsing for Image Captioning - Yao T et al, ICCV 2019.
Entangled Transformer for Image Captioning - Li G et al, ICCV 2019.
Attention on Attention for Image Captioning - Huang L et al, ICCV 2019. [code]
Reflective Decoding Network for Image Captioning - Ke L at al, ICCV 2019.
Learning to Collocate Neural Modules for Image Captioning - Yang X et al, ICCV 2019.

`NeurIPS 2019`

Image Captioning: Transforming Objects into Words - Herdade S et al, NeurIPS 2019.
Adaptively Aligned Image Captioning via Adaptive Attention Time - Huang L et al, NeurIPS 2019. [code]
Variational Structured Semantic Inference for Diverse Image Captioning - Chen F et al, NeurIPS 2019.
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations - Liu F et al, NeurIPS 2019. [code]

`IJCAI 2019`

Image Captioning with Compositional Neural Module Networks - Tian J et al, IJCAI 2019.
Exploring and Distilling Cross-Modal Information for Image Captioning - Liu F et al, IJCAI 2019.
Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization - Wang H et al, IJCAI 2019.
Hornet: a hierarchical offshoot recurrent network for improving person re-ID via image captioning - Yan S et al, IJCAI 2019.

`EMNLP 2019`

Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach - Kim D J et al, EMNLP 2019.
TIGEr: Text-to-Image Grounding for Image Caption Evaluation - Jiang M et al, EMNLP 2019.
REO-Relevance, Extraness, Omission: A Fine-grained Evaluation for Image Captioning - Jiang M et al, EMNLP 2019.
Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering - Changpinyo S et al, EMNLP 2019.

`CoNLL 2019`

Compositional Generalization in Image Captioning - Nikolaus M et al, CoNLL 2019. [code]

2020

`AAAI 2020`

MemCap: Memorizing Style Knowledge for Image Captioning - Zhao et al, AAAI 2020.
Unified Vision-Language Pre-Training for Image Captioning and VQA - Zhou L et al, AAAI 2020.
Show, Recall, and Tell: Image Captioning with Recall Mechanism - Wang L et al, AAAI 2020.
Reinforcing an Image Caption Generator using Off-line Human Feedback - Hongsuck Seo P et al, AAAI 2020.
Interactive Dual Generative Adversarial Networks for Image Captioning - Liu et al, AAAI 2020.
Feature Deformation Meta-Networks in Image Captioning of Novel Objects - Cao et al, AAAI 2020.
Joint Commonsense and Relation Reasoning for Image and Video Captioning - Hou et al, AAAI 2020.
Learning Long- and Short-Term User Literal-Preference with Multimodal Hierarchical Transformer Network for Personalized Image Caption - Zhang et al, AAAI 2020.

`CVPR 2020`

Normalized and Geometry-Aware Self-Attention Network for Image Captioning - Guo L et al, CVPR 2020.
Object Relational Graph with Teacher-Recommended Learning for Video Captioning - Zhang Z et al, CVPR 2020.
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs - Chen S et al, CVPR 2020.
X-Linear Attention Networks for Image Captioning - Pan et al, CVPR 2020.

`ACL 2020`

Improving Image Captioning with Better Use of Caption - Shi Z et al, ACL 2020.
Cross-modal Coherence Modeling for Caption Generation - Alikhani M et al, ACL 2020.
Improving Image Captioning Evaluation by Considering Inter References Variance - Yi Y et al, ACL 2020.
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning - Lei J et al, ACL 2020.
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA - Kim H et al, ACL 2020.

`ECCV 2020`

Length-Controllable Image Captioning - Deng C et al, ECCV 2020.
Captioning Images Taken by People Who Are Blind - Gurari D et al, ECCV 2020.
Towards Unique and Informative Captioning of Images - Wang Z et al, ECCV 2020.
Learning Visual Representations with Caption Annotations - Sariyildiz M et al, ECCV 2020.
Comprehensive Image Captioning via Scene Graph Decomposition - Zhong Y et al, ECCV 2020.
SODA: Story Oriented Dense Video Captioning Evaluation Framework - Fujita S et al, ECCV 2020.
TextCaps: a Dataset for Image Captioning with Reading Comprehension - Sidorov O et al, ECCV 2020.
Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets - Wang J et al, ECCV 2020.
Learning to Generate Grounded Visual Captions without Localization Supervision - Ma C et al, ECCV 2020.
Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards - Yang X et al, ECCV 2020.
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos - Chen S et al, ECCV 2020.

`EMNLP 2020`

CapWAP: Image Captioning with a Purpose - Fisch A et al, EMNLP 2020.
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers - Cho J et al, EMNLP 2020.
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning - Fang Z et al, EMNLP 2020.
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements - Li Y et al, EMNLP 2020.

`NeurIPS 2020`

Diverse Image Captioning with Context-Object Split Latent Spaces - Mahajan S et al, NeurIPS 2020.
RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning - Chiaro R et al, NeurIPS 2020.

Dataset

nocaps, LANG: English
MS COCO, LANG: English.
Flickr 8k, LANG: English.
Flickr 30k, LANG: English.
AI Challenger, LANG: Chinese.
Visual Genome, LANG: English.
SBUCaptionedPhotoDataset, LANG: English.
IAPR TC-12, LANG: English, German and Spanish.

Image Captioning Challenge

Popular Implementations

PyTorch

TensorFlow

Torch

Others

Licenses

To the extent possible under law, Zhihong Chen has waived all copyright and related or neighboring rights to this work.

Awesome

Awesome Image Captioning

Contributing

Change Log

Table of Contents

Papers

Survey

Before

2015

CVPR 2015

ICCV 2015

NIPS 2015

ICML 2015

arXiv preprint 2015

2016

CVPR 2016

ACMMM 2016

ACL 2016

arXiv preprint 2016

2017

CVPR 2017

ICCV 2017

AAAI 2017

NIPS 2017

TPAMI 2017

arXiv preprint 2017

2018

CVPR 2018

ECCV 2018

AAAI 2018

NeurIPS 2018

NAACL 2018

ACL 2018

EMNLP 2018

arXiv preprint 2018

2019

CVPR 2019

AAAI 2019

ACL 2019

BMVC 2019

ICCV 2019

NeurIPS 2019

IJCAI 2019

EMNLP 2019

CoNLL 2019