Home

Awesome

Explicit Image Caption Editing

This repository contains the datasets and reference code for the paper Explicit Image Caption Editing accpeted to ECCV 2022. Refer to our full paper for detailed intructions and analysis. Example

Overview

The Explicit Caption Editing (ECE) task is defined as follows. Given an image and a reference caption (Ref-Cap), ECE models aim to explicitly predict a sequence of edit operations (e.g., KEEP/DELETE/ADD) on the Ref-Cap, which can translate the Ref-Cap close to the ground-truth caption (GT-Cap). Typically, Ref-Cap is lightly misaligned with the image.

ECE datasets

The ECE datasets include the COCO-EE and Flickr30K-EE.

Specifically, the COCO-EE was built based on dataset MSCOCO, the Flikr30K-EE was built based on the dataset e-ViL and Flickr30K.

Each ECE instance contains three main information:

Examples from COCO-EE and Flickr30K-EE

Example2

Statistical summary of the COCO-EE and Flickr30K-EE

COCO-EEFlickr30K-EE
TrainDevTestTrainDevTest
#Editing instances97,5675,6285,366108,2384,8984,910
#Images52,5873,0552,94829,7831,0001,000
Mean Reference Caption Length10.310.210.17.37.47.4
Mean Ground-Truth Caption Length9.79.89.86.26.36.3
Mean Edit Distance10.911.010.98.88.88.9

Dataset Construction

The processed datasets have been placed in the dataset folder, they can also be directly download from here, including the COCO-EE and Flickr30K-EE in train, dev and test splits.

Or, you can follow the instructions below to set up the environment and construct them:

COCO-EE Construction

  1. Setup coco-edit submodule and follow its instructions form this.

Flickr30K-EE Construction

  1. Setup environment
    conda create -n flkree python=3.7
    conda activate flkree
    conda install json
    conda install csv
    
  2. Prepare the esnlive data and the output folder
  3. Construct Flikr30K-EE
python construct_flickr30k_ee.py --split <split>

The ECE model: TIger

The code of our proposed ECE model TIger are now available here.