Awesome
This repository includes the code for end-to-end gLSTM and sentence-conditional semantic attention, as appeared in the paper "Watch What You Just Said: Image Captioning with Text-Conditional Attention". train_new.lua is the main file for e2e-gLSTM and train_sc.lua is the main file for sentence-conditional semantic attention. Here are the example commands:
th train_new.lua -cnn_model_resnet /path/to/your/resnet-200-model -language_eval 1 -finetune_cnn_after 100000 -max_iters 600000 -cnn_weight_decay 0.001 -cnn_learning_rate 0.00001 -learning_rate_decay_every 100000 -learning_rate_decay_start 100000
th train_sc.lua -start_from /path/to/your/e2eglstm-checkpoint -language_eval 1 -language_model 'misc_tc.LanguageModel_sc' -max_iters 200000
Note that if you transfer weights from vgg-16 or resnet-34, the -max-iters values could be smaller. The result table is shown below.
Methods | Bleu@4 | METEOR | CIDEr |
---|---|---|---|
sc-vgg-16 | 30.1 | 24.7 | 97.0 |
sc-resnet-34 | 30.6 | 25.0 | 98.1 |
sc-resnet-200 | 31.6 | 25.6 | 101.2 |
The implementation is based on Neuraltalk2. Please follow the instructions on Neuraltalk2 to run the code. Contact me if you have any trouble running the code . Please cite the following paper if you are using the code.
@article{zhou2016image,
title={Image Caption Generation with Text-Conditional Semantic Attention},
author={Zhou, Luowei and Xu, Chenliang and Koch, Parker and Corso, Jason J},
journal={arXiv preprint arXiv:1606.04621},
year={2016}
}