Awesome

Semantic Propositional Image Caption Evaluation (SPICE)

Evaluation code for machine-generated image captions.

Requirements

java 1.8.0+

Dependencies

Stanford CoreNLP 3.6.0
Stanford Scene Graph Parser
Meteor 1.5 (for synset matching)

Usage

To run SPICE, call the following (from the target directory):

java -Xmx8G -jar spice-*.jar

Running SPICE with no arguments prints the following help message:

SPICE version 1

Usage: java -Xmx8G -jar spice-*.jar <input.json> [options]

Options:
-out <outfile>                   Output json scores and tuples data to <outfile>
-cache <dir>                     Set directory for caching reference caption parses
-threads <num>                   Defaults to the number of processors
-detailed                        Include propositions for each caption in json output.
-noSynsets                       Disable METEOR-based synonym matching
-subset                          Report results in <outfile> for various semantic tuple subsets
-silent                          Disable stdout results

See README file for additional information and input format details

The input.json file should contain of an array of json objects, each representing a single caption and containing image_id, test and refs fields. See example_input.json

It is recommended to provide a path to an empty directory in the -cache argument as it makes repeated evaluations much faster.

Build

To build SPICE and its dependencies from source, and run tests, use Maven with the following command: mvn clean verify. The jar file spice-*.jar will be created in the target directory, with required dependencies in target/src.

Building SPICE from source is NOT required as precompiled jar files are available on the project page.

A note on the magnitude of SPICE scores

On MS COCO, with 5 reference captions scores are typically in the range 0.15 - 0.20. With 40 reference captions, scores are typically in the range 0.03 - 0.07. This is the expected result due to the impact of the recall component of the metric. To make the scores more readable, on the MS COCO leaderboard, C40 SPICE scores are multiplied by 10.

Policy gradient optimization of SPICE

We read with interest a paper that directly optimized SPICE (and other metrics) using policy gradients. The results indicated that optimizing SPICE and CIDEr (SPIDEr) produced the best captions, but that optimizing SPICE on its own leads to ungrammatical results. This is because SPICE ignores, and does not penalize repeated scene graph tuples. However, it would be straightforward to adjust the metric to penalize repetition. Contact us for details.

References

If you report SPICE scores, please cite the SPICE paper:

Developers

Peter Anderson (Australian National University) (peter.anderson@anu.edu.au)

Acknowledgements

This work is based on the SceneGraphParser developed by Sebastian Schuster (Stanford).
We re-use the Wordnet synset matching code from Meteor 1.5 to identify synonyms.

License

GNU AGPL v3