Awesome
Fluency-Guided Cross-Lingual Image Captioning
Introduction
This is the code for the paper Weiyu Lan, Xirong Li, Jianfeng Dong, Fluency-Guided Cross-Lingual Image Captioning, ACM MM 2017 .
In this paper, we present an approach to cross-lingual image captioning by utilizing machine translation. A fluency-guided learning framework is proposed to deal with the lack of fluency in machine-translated sentences. This repository provides data and code for training a Chinese captioning model that can generate fluent and relevant Chinese captions for a given image. With machine translated captions in other languages and estimated fluency scores, you can also train a fluency-guided captioning model for the new target language.
Requirements
Install Required Packages
First ensure that you have installed the following required packages:
- TensorFlow 1.0 or greater (instructions)
Prepare the Data
Run download_cn_data.sh
to get the text data and
extracted feature from ResNet-152 on flickr8k and flickr30k (totally ~296M).
Text data includes machine-translated Chinese captions, estimated fluency scores,
and human-translated captions on test sets for evaluation.
Word segmentation is performed to tokenize a given sentence to a sequence of Chinese words
using boson,
since Chinese sentences are written without explicit word delimiters.
Extracted data is placed in $HOME/VisualSearch/
.
Training and Evaluating a Model
Run the script.
cd doit
bash do-all.sh
Running the script will do the following things:
- Generate a dictionary on the training set, keeping words that occur >= 5 times
- Train the fluency-guided cross-lingual image captioning model using rejection sampling and dump the model checkpoints
- Run evaluation on the validation set and log loss information of the checkpoints
- Generate captions on test set using the checkpoint that perform best on the validation set and evaluate the performance
The trained model and the evaluation results are all shown in $HOME/VisualSearch/$collection/
Expected Performance
The expected performance of different fluency-guided approaches on Flickr8k-cn is as follows:
Approach | BLEU4 | ROUGE_L | CIDEr |
---|---|---|---|
Without fluency | 24.1 | 45.9 | 47.6 |
Fluency-only | 20.7 | 41.1 | 35.2 |
Rejection sampling | 23.9 | 45.3 | 46.6 |
Weighted loss | 24.0 | 45.0 | 46.3 |