Awesome

What's in here?

This repo contains the code for our EMNLP 2021 paper: CLIPScore: A Reference-free Evaluation Metric for Image Captioning. CLIPScore is a metric that you can use to evaluate the quality of an automatic image captioning system. In our paper, we show that CLIPScore achieves high correlation with human judgment on literal image captioning tasks. However, unlike BLEU or CIDEr, CLIPScore doesn't require reference captions.

If you find the paper or this code useful, please consider citing:

@inproceedings{hessel2021clipscore,
  title={{CLIPScore:} A Reference-free Evaluation Metric for Image Captioning},
  author={Hessel, Jack and Holtzman, Ari and Forbes, Maxwell and Bras, Ronan Le and Choi, Yejin},
  booktitle={EMNLP},
  year={2021}
}

How do I run the code?

Command Line

Example usage

> python clipscore.py example/good_captions.json example/images/
...
CLIPScore: 0.8584

If you include optionally some references, you will see RefCLIPScore, alongside a usual set of caption generation evaluation metrics. The references are optional.

> python clipscore.py example/good_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.6667
BLEU-2: 0.4899
BLEU-3: 0.3469
BLEU-4: 0.0000
METEOR: 0.3444
ROUGE: 0.4280
CIDER: 0.5637
SPICE: 0.4000
CLIPScore: 0.8584
RefCLIPScore: 0.8450

Worse captions should get lower scores:

> python clipscore.py example/bad_captions.json example/images/ --references_json example/refs.json
...
BLEU-1: 0.4815
BLEU-2: 0.2404
BLEU-3: 0.1359
BLEU-4: 0.0000
METEOR: 0.1861
ROUGE: 0.3121
CIDER: 0.2790
SPICE: 0.1500
CLIPScore: 0.7153
RefCLIPScore: 0.7253

You can treat/report CLIPScore and RefCLIPScore similarly to the other evaluation metrics. See the paper for more details about CLIPScore and RefCLIPScore. Full usage options can be given by python clipscore.py -h. An example set of inputs, including a candidate json, image directory, and references json is given this repo under example/

The input files are formatted as follows:

The candidates json should be a dictionary that maps from {"string_image_identifier": "candidate"}, e.g.,

{'image1': 'an orange cat and a grey cat are lying together.',
 'image2': 'a black dog looks at the camera.'
 ...}

The image directory should be a directory containing the images that act as the keys in the candidates json, e.g.,

images/
├── image1.jpg
└── image2.jpg

and, finally, the references json should be a dictionary that maps from {"string_image_identifier": ["list", "of", "references"]}, e.g.,

{"image1": ["two cats are sleeping next to each other.",
            "a grey cat is cuddling with an orange cat on a blanket.",
	    "the orange cat is happy that the black cat is close to it."],
 "image2": ["a dog is wearing ear muffs as it lies on a carpet.",
            "a black dog and an orange cat are looking at the photographer.",
	    "headphones are placed on a dogs ears."]}

MSCOCO dataset in pycocoevalcap

If you're running on the MSCOCO dataset and using the standard evaluation toolkit, you can use our version of pycocoevalcap to evaluate. You won't even need to download the original MSCOCO images, thanks to a bit of magic :-)

To use pycocoevalcap on the MSCOCO dataset in the MSCOCO format, you can simply use:

pip install git+https://github.com/jmhessel/pycocoevalcap.git

there is an example evaluation in that repo under examples/eval.py. After pip installing, if you clone the pycocoeval repo and run

python eval.py

after a bit of time, the output should be:

Bleu_1: 0.579
Bleu_2: 0.404
Bleu_3: 0.279
Bleu_4: 0.191
METEOR: 0.195
ROUGE_L: 0.396
CIDEr: 0.600
SPICE: 0.133
CLIPScore: 0.528
RefCLIPScore: 0.605

Reproducibility notes: