Home

Awesome

Awesome-Image Captioning

A paper list of image captioning as supplementary reference to this short survey. Based on this survey, we combed the papers and its codes in the field of IC in recent years.

This paper list is organized as follows:

Ⅰ. the existing surveys in IC field

Ⅱ. three main directions of current IC:

Nowadays, mainstream of IC model is heterogenous encoder-decoder architecture with three major improvement directions.

​ visual feature: advancement of encoder(CNN)

​ attention mechanism: changes in the attended source; modification of the architecture of the attention module

​ visual and language structure: explorations of structural inductive bias

Ⅲ. Transformer & homogenous architecture

Many remarkable improvements in performance have achieved after the advent of Transformer.
Thanks to the architectural advantages of Transformer, a promising pure Transformer-based homogeneous encoder-decoder captioner is around the corner.

Ⅳ. large scale pretraining

Motivated by NLP , researchers in the vision-language domain also proposed to train the large-scale Transformer architectures. Some of these multi-modal large-scale pre-training models can also be used for IC and have achieved much better performances than small-scale ones.

**If need, I'm glad to supplement other paper information such as journal reference and continue to update latest awesome works. However, I'm busy with other issue currently and could not update this paper list in recent time.

Most of the journal reference can be found at ArXiv(since the pdf link I've already provided) and meanwhile I recommend this webside to search source code.

Survey

Current Image Captioning

Classic Encoder-Decoder Captioner

Visual Feature -- CNN

Attention Mechanism

Visual and Language Structure -- Inductive Bias

Transformer & Homogenous Architecture

Large Scale Pretraining

Prompt