Home

Awesome

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Introduction

VSUA model represents images as structured graphs where nodes are the so-called Visual Semantic Units (VSUs): object, attribute, and relationship units. Our VSUA model makes use of the alignment nature between caption words and VSUs.

<p align="center"> <img src="vsua.jpg" width="60%" title="introduction image"> </p>

Citation

If you find this code useful in your research then please cite

@inproceedings{guo2019vsua,
 title={Aligning Linguistic Words and Visual Semantic Units for Image Captioning},
 author={Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu},
 booktitle={ACM MM},
 year={2019}}

Requirements

To install all submodules: git clone --recursive https://github.com/ltguo19/VSUA-Captioning.git

Prepare Data

For more details and other dataset, see ruotianluo/self-critical.pytorch

1. Download COCO captions and preprocess them

Download preprocessed coco captions from link from Karpathy's homepage. Extract dataset_coco.json from the zip file and copy it into data/. This file provides preprocessed captions and also standard train-val-test splits.

Then do:

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

prepro_labels.py will map all words that occur <= 5 times to a special UNK token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into data/cocotalk.json and discretized caption data are dumped into data/cocotalk_label.h5.

2. Download Bottom-Up features

We use the pre-extracted bottom-up image features. Download pre-extracted feature from link (we use the adaptive one in our experiments). For example:

mkdir data/bu_data; cd data/bu_data
wget https://storage.googleapis.com/bottom-up-attention/trainval.zip
unzip trainval.zip

Then:

python script/make_bu_data.py --output_dir data/cocobu

This will create data/cocobu_fc, data/cocobu_att and data/cocobu_box.

3. Download image scene graph data

We use the scene graph data from yangxuntu/SGAE. Download the files coco_img_sg.zip and coco_pred_sg_rela.npy from this link and put them into the folder data and then unzip them. coco_img_sg.zip contains scene graph data for each image, including object labels and attributes labels for each box in the adaptive bottom-up data, and the semantic relationship labels between boxes. coco_pred_sg_rela.npy contains the vocabularies for the object, attribute and relation labels.

4. Extract geometry relationship data

Download the files vsua_box_info.pkl from this link, which contains the size of each box and the width/height of each image. Then do:

python scripts/cal_geometry_feats.py
python scripts/build_geometry_graph.py

to extract the geometry relation features and build the geometry graph. This will createdata/geometry_feats-undirected.pkl and data/geometry-iou0.2-dist0.5-undirected.

Overall, the data folder should contain these files/folders:

cocotalk.json         	# additional information about images and vocab
cocotalk_label.h5       # captions
coco-train-idxs.p       # cached token file for cider
cocobu_att              # bottom-up feature
cocobu_fc               # bottom-up average feature
coco_img_sg             # scene graph data
coco_pred_sg_rela.npy   # scene graph vocabularies
vsua_box_info.pkl       # boxes and width and height of images
geometry-iou0.2-dist0.5-undirected  # geometry graph data

Training

1. Cross-entropy loss

python train.py --gpus 0 --id experiment-xe --geometry_relation True

The train script will dump checkpoints into the folder specified by --checkpoint_root and --id.

2. Reinforcement learning with CIDEr reward

python train.py --gpus 0 --id experiment-rl --geometry_relation True --learning_rate 5e-5 --resume_from experiment-xe --resume_from_best True --self_critical_after 0 --max_epochs 50

Acknowledgement

This code is modified from Ruotian Luo's brilliant image captioning repo ruotianluo/self-critical.pytorch. We use the visual features provided by Bottom-Up peteanderson80/bottom-up-attention, and the scene graph data provided by yangxuntu/SGAE. Thanks for their works! If you find this code helpful, please consider citing their corresponding papers and our paper.