Home

Awesome

Introduction

This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts.

Prerequisites

Usage

python main.py --phase=train \
    --load_cnn \
    --cnn_model_file='./vgg16_no_fc.npy'\
    [--train_cnn]    

Turn on --train_cnn if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder models. If you want to resume the training from a checkpoint, run a command like this:

python main.py --phase=train \
    --load \
    --model_file='./models/xxxxxx.npy'\
    [--train_cnn]

To monitor the progress of training, run the following command:

tensorboard --logdir='./summary/'
python main.py --phase=eval \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The result will be shown in stdout. Furthermore, the generated captions will be saved in the file val/results.json.

python main.py --phase=test \
    --model_file='./models/xxxxxx.npy' \
    --beam_size=3

The generated captions will be saved in the folder test/results.

Results

A pretrained model with default configuration can be downloaded here. This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with beam size=3):

Here are some captions generated by this model: examples

References