Home

Awesome

Crisscrossed Captions

Crisscrossed Captions (CxC) contains 247,315 human-labeled annotations including positive and negative associations between image pairs, caption pairs and image-caption pairs.

For more details, please refer the accompanying paper:
Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Motivation

Image captioning datasets have proven useful for multimodal representation learning, and a common evaluation paradigm based on multimodal retrieval has emerged. Unfortunately, datasets have only limited cross-modal associations: images are not paired with others, captions are only paired with others that describe the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines retrieval evaluation and limits research into how inter-modality learning impacts intra-modality tasks. CxC addresses this gap by extending MS-COCO (dev and test sets from the Karpathy split) with new semantic similarity judgments.

Below are some examples of caption pairs rated based on Semantic Textual Similarity:

Caption 1: A kite flying in the air over a sand castle.

Caption 2: Scattered people on a wide dry beach including surfers.

Caption 1: Giraffe watching a man push a wheelbarrow loaded with hay.

Caption 2: Two giraffes stand outside of a large building.

Caption 1: A man is sitting on a bench while another takes a nap.

Caption 2: There is an old woman sitting on a bench.

Caption 1: A train is driving down the tracks in front of a building.

Caption 2: A purple and yellow train traveling down train tracks.

Caption 1: A cut pizza and a glass on a table.

Caption 2: Small pizza sits on a plate on a restaurant table.

Caption 1: A family of sheep standing next to each other on a lush green field.

Caption 2: A herd of sheep standing next to each other on a lush green field.

Structure of the data

There are 2 CSV files per task (STS, SIS, SITS) and per split (val, test): one with raw annotator scores ('*_raw.csv') and one with aggregated scores per example. The first two columns represent the IDs from MS-COCO for the corresponding image or caption, followed by the annotation score. The last column indicates the method in which the example was sampled:

c2c_cocaption: caption pairs from the same MS-COCO example

c2c_isim: caption pairs from different MS-COCO examples sampled based on image similarity

i2i_csim: image pairs from different MS-COCO examples sampling based on caption similarity.

c2i_intrasim: caption-image pairs from different MS-COCO examples.

c2i_original: caption-image pairs from the same MS-COCO examples.

Examples

Following are some examples for each task:

Augment MS-COCO examples with CxC labels

Download the MS-COCO Karpathy split annotations (from here) and pass them as coco_input in the following merge script:

python -m crisscrossed_captions/setup --coco_input "/path/to/coco/json" --cxc_input "/path/to/cxc/sits/*" --output "/path/to/combined/json"

Reference

If you use or discuss this dataset in your work, please cite our paper:

@article{parekh2020crisscrossed,
  title={Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO},
  author={Parekh, Zarana and Baldridge, Jason and Cer, Daniel and Waters, Austin and Yang, Yinfei},
  journal={arXiv preprint arXiv:2004.15020},
  year={2020}
}

Contact

If you have a technical question regarding the dataset or publication, please create an issue in this repository.