Home

Awesome

COMIC: Towards a Compact Image Captioning Model with Attention

Updated on 25 Feb 2021: Object Relation Transformer with Radix encoding that can achieve CIDEr score of 1.291 after SCST training. Code at this repo.

Updated on 12 June 2019: Self-Critical Sequence Training (SCST)

Updated on 06 June 2019: Pre-trained model repo

Released on 03 June 2019.

Description

This is the code repo of our TMM 2019 work titled "COMIC: Towards A Compact Image Captioning Model with Attention". In this paper, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed competitive results on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabularly size that is 39x-99x smaller.

<img src="TMM.png" height="200">

Some pre-trained model checkpoints are available at this repo.

Visualisation

You can explore and visualise generated captions using this Streamlit app.

Citation

If you find this repository useful for your research or work, please cite:

@article{tan2019comic,
  title={COMIC: Towards A Compact Image Captioning Model with Attention},
  author={Tan, Jia Huei and Chan, Chee Seng and Chuah, Joon Huang},
  journal={IEEE Transactions on Multimedia},
  year={2019},
  volume={21},
  number={10},
  pages={2686-2696},
  publisher={IEEE}
}

Dependencies

Installing Java 8 on Ubuntu

  1. Download the required tar.gz files from Oracle:
  2. Follow instructions on this repo.

Running the code

More examples are given in example.sh.

First setup

Run ./src/setup.sh. This will download the required Stanford models and run all the dataset pre-processing scripts.

Training models

The training scheme is as follows:

  1. Start with decoder mode (freezing the CNN)
  2. Followed by cnn_finetune mode
  3. Finally, scst mode

COMIC-256

# MS-COCO
for mode in 'decoder' 'cnn_finetune' 'scst'
do
    python train.py  \
        --train_mode ${mode}
done

# InstaPIC
for mode in 'decoder' 'cnn_finetune' 'scst'
do
    python train.py  \
        --train_mode ${mode}  \
        --dataset_file_pattern 'insta_{}_v25595_s15'  \
        --batch_size_eval 50
done

Baseline

# MS-COCO
for mode in 'decoder' 'cnn_finetune' 'scst'
do
    python train.py  \
        --train_mode ${mode}  \
        --token_type 'word'  \
        --cnn_fm_projection 'none'  \
        --attn_num_heads 1
done

# InstaPIC
for mode in 'decoder' 'cnn_finetune' 'scst'
do
    python train.py  \
        --train_mode ${mode}  \
        --dataset_file_pattern 'insta_{}_v25595_s15'  \
        --token_type 'word'  \
        --cnn_fm_projection 'none'  \
        --attn_num_heads 1  \
        --batch_size_eval 50
done

Inferencing

Just point infer.py to the directory containing the checkpoints. Model configurations are loaded from config.pkl.

# MS-COCO
python infer.py  \
	--infer_checkpoints_dir 'mscoco/word_add_softmax_h8_tie_lstm_run_01'

# InstaPIC
python infer.py  \
	--infer_checkpoints_dir 'insta/word_add_softmax_h8_ind_lstm_run_01'  \
	--dataset_dir '/path/to/insta/dataset'  \
	--annotations_file 'insta_testval_clean.json'

Project structure

.
+-- src
|   +-- {main scripts}
+-- common
|   +-- {shared libraries and utility functions}
+-- datasets
|   +-- preprocessing
|   |   +-- {dataset pre-processing scripts}
|   +-- {dataset folders created by pre-processing scripts, eg 'mscoco'}
+-- pretrained
    +-- {pre-trained checkpoints for some COMIC models. Details are provided in a separate README.}

Avoid re-downloading datasets

Re-downloading can be avoided by:

  1. Editing setup.sh
  2. Providing the path to the directory containing the dataset files
python coco_prepro.py --dataset_dir /path/to/coco/dataset
python insta_prepro.py --dataset_dir /path/to/insta/dataset

In the same way, both train.py and infer.py accept alternative dataset paths.

python train.py --dataset_dir /path/to/dataset
python infer.py --dataset_dir /path/to/dataset

This code assumes the following dataset directory structures:

MS-COCO

{coco-folder}
+-- captions
|   +-- {folder and files generated by coco_prepro.py}
+-- test2014
|   +-- {image files}
+-- train2014
|   +-- {image files}
+-- val2014
    +-- {image files}

InstaPIC-1.1M

{insta-folder}
+-- captions
|   +-- {folder and files generated by insta_prepro.py}
+-- images
|   +-- {image files}
+-- json
    +-- insta-caption-test1.json
    +-- insta-caption-train.json

Differences compared to our TMM paper

To match the settings as described in our paper, set the legacy argument of train.py to True (the default is False). This will override some of the provided arguments.

When using the default arguments, the differences compared to our TMM paper are:

Changes that can be enabled:

Performance on MS-COCO

Inception-V1 and LSTM:

Default modeDecoder params.BLEU-1BLEU-4CIDErSPICE
Baseline12.7 M0.7160.3110.9370.174
COMIC-2564.3 M0.7130.3080.9440.176
(+ CNN fine-tune)0.7290.3281.0010.185
(+ SCST ^)0.7530.3441.0500.190

^ SCST using beam search sampling strategy as described in this paper.

Legacy modeDecoder params.BLEU-1BLEU-4CIDErSPICE
Baseline12.2 M0.7070.3000.9060.169
(0.701)(0.296)(0.885)(0.167)
COMIC-2564.0 M0.7110.3020.9130.170
(0.706)(0.292)(0.881)(0.164)

Note that scores in brackets () indicate figures stated in our TMM paper. The differences are due to reimplementation from TF-1.2.

Please see above for info on downloading checkpoints of the models listed above.

Object Relation Transformer:

Default modeDecoder params.BLEU-1BLEU-4CIDErSPICE
Baseline55.44 M0.7560.3481.1350.213
Radix45.98 M0.7560.3491.1350.209
+ SCST, beam = 50.8030.3901.2910.213

Transformer code at this repo.

Main arguments

train.py

infer.py

Microsoft COCO Caption Evaluation

This code uses the standard coco-caption code with SPICE metric [Link to repo].

To perform online server evaluation:

  1. Infer on coco_test (test2014), rename the JSON output file to captions_test2014__results.json.
  2. Infer on coco_valid (val2014), rename the JSON output file to captions_val2014__results.json.
  3. Zip the files and submit.

Acknowledgements

Thanks to the developers of:

Feedback

Suggestions and opinions (both positive and negative) are greatly welcomed. Please contact the authors by sending an email to tan.jia.huei at gmail.com or cs.chan at um.edu.my.

License and Copyright

The project is open source under BSD-3 license (see the LICENSE file).

© 2019 Center of Image and Signal Processing, Faculty of Computer Science and Information Technology, University of Malaya.