Home

Awesome

where-image2

Code used by the paper "Where to put the Image in an Image Caption Generator", which was accepted for Natural Language Engineering Special Issue on Language for Images. Code for previous version of paper (prior to review) is still available here https://github.com/mtanti/where-image.

This paper is a comparative evaluation of different ways to incorporate an image into an image caption generator neural network.

Works on both Python 2 and Python 3 (except for the MSCOCO evaluation toolkit which requires python 2).

Dependencies

Python dependencies (install all with pip):

Before running

  1. Download Karpathy's Flickr8k, Flickr30k, and MSCOCO datasets (including image features).
  2. Download the MSCOCO Evaluation toolkit.
  3. Open config.py.
  4. Set debug to True or False (True is used to run a quick test).
  5. Set raw_data_dir to return the directory to the Karpathy datasets (dataset_name is 'flickr8k', 'flickr30k', or 'mscoco').
  6. Set mscoco_dir to the directory to the MSCOCO Evaluation toolkit.

File descriptions

File nameDescription
resultsFolder storing results of each architecture's evaluation. You can find generated captions in results/*/generated_captions.txt and each caption corresponds to the image file name given by the corresponding line in results/imgs_*.txt. There is also a matrix in results/*/retrieved_images.txt of rows equal to the number of captions and columns equal to the number of images which gives the log probability of each possible caption/image pair. multimodal_diffs_results_*.txt files are described below.
hyperparamsFolder storing results of the hyperparameter tuning.
hyperparam_phase1.py (main)Used to evaluate different hyperparameters on each architecture. Will save results in hyperparams. Delete hyperparams/completed.txt before running.
hyperparam_phase2.py (main)Used to fine-tune the best hyperparameters found by hyperparam_phase1.py. Will save results in hyperparams. Delete hyperparams/completed2.txt before running.
experiment.py (main)Used to run the actual experiments which evaluate each architecture. Will save results in results. Delete results/results.txt before running. If you have fine-tuned hyperparameters using hyperparam_phase1.py or hyperparam_phase2.py then you will first need to copy the hyperparameters found from hyperparams/result_*.txt to config.py.
multimodal_vector_diffs.py (main)Used to measure the ability of each architecture to remember the image information as the caption gets generated. Will save results in results/multimodal_diffs_results_*.txt where '*' is the caption length used.
config.py (library)Configuration file containing hyperparameters, directories, and other settings.
data.py (library)Functions and classes that deal with handling the datasets.
helper_datasources.py (library)Functions and classes that simplify loading datasets.
lib.py (library)General helper functions and classes.
model_base.py (library)Super class for neural caption generation models that handles general applications such as beam search and sentence probability. This was created in order to facilitate creation of other model instatiations such as ensembles.
model_idealmock.py (library)Neural caption generator that just memorises the test set and reproduces it (called human in the results). Used as a test for the generation and retrieval algorithms and as a ceiling for the caption diversity measures.
model_normal.py (library)Neural caption generator with the actual architectures being tested.
results.xlsx (processed data)MS Excel spreadsheet with the results of experiments.py.
results_memory.xlsx (processed data)MS Excel spreadsheet with the results of multimodal_vector_diffs.py.

Results

Descriptions of each column in results/results.txt and results.xlsx. 'Generated captions' refers to captions that were generated for test set images.

Column nameDescription
dataset_nameThe dataset used (Flickr8k, Flickr30k, or MSCOCO).
architectureThe architecture being tested (init, pre, par, merge, or human).
runEach architecture is trained 3 separate times and the average of each result is taken. This column specifies the run number (1, 2, or 3).
vocab_sizeThe number of different word types in the vocabulary (all words in the training set that occur at least 5 times).
num_training_capsThe number of captions in the training set.
mean_training_caps_lenThe mean caption length in the training set.
num_paramsThe number of parameters (weights and biases) in the architecture.
geomean_pplxThe geometric mean of the perplexity of all test set captions (given the image).
num_inf_pplxThe number of test set captions that resulted in infinity and were ignored from the geometric mean (occurs when at least one word has a probability of 0).
vocab_usedThe number of words from the vocabulary that were used to generate all the captions.
vocab_used_fracThe fraction of vocabulary words that were used to generate all the captions.
mean_cap_lenThe mean caption length of the generated captions.
num_existing_capsThe number of generated captions that were found somewhere in the training set as-is.
num_existing_caps_fracThe fraction of generated captions that were found somewhere in the training set as-is.
existing_caps_CIDErThe CIDEr score of the generated captions that were found somewhere in the training set as-is (used to check if in the case where captions were being parroted, at least they were correct).
unigram_entropyThe entropy of unigram (word) frequencies in the generated captions.
bigram_entropyThe entropy of bigram (two adjacent words) frequencies in the generated captions.
CIDErThe CIDEr score of the generated captions (generated by the MSCOCO Evaluation toolkit).
METEORThe METEOR score of the generated captions (generated by the MSCOCO Evaluation toolkit).
ROUGE_LThe ROUGE L score of the generated captions (generated by the MSCOCO Evaluation toolkit).
Bleu_1The BLEU-1 score of the generated captions (generated by the MSCOCO Evaluation toolkit).
Bleu_2The BLEU-2 score of the generated captions (generated by the MSCOCO Evaluation toolkit).
Bleu_3The BLEU-3 score of the generated captions (generated by the MSCOCO Evaluation toolkit).
Bleu_4The BLEU-4 score of the generated captions (generated by the MSCOCO Evaluation toolkit).
R@1The number of correct images that were the most relevant to their corresponding caption among all other images.
R@5The number of correct images that were among the top 5 most relevant to their corresponding caption among all other images.
R@10The number of correct images that were among the top 10 most relevant to their corresponding caption among all other images.
median_rankThe median rank of correct images when sorted by their relevance to their corresponding caption among all other images.
R@1_fracThe fraction of correct images that were the most relevant to their corresponding caption among all other images.
R@5_fracThe fraction of correct images that were among the top 5 most relevant to their corresponding caption among all other images.
R@10_fracThe fraction of correct images that were among the top 10 most relevant to their corresponding caption among all other images.
median_rank_fracThe median rank of correct images when sorted by their relevance to their corresponding caption among all other images, divided by the number of images.
num_epochsThe number of epochs needed to train the model, before the perplexity on the validation set started to degrade.
training_timeThe number of seconds needed to train the model.
total_timeThe number of seconds needed to train and evaluate the model.