Awesome

UpDown Captioner Baseline for `nocaps`

Baseline model for nocaps benchmark, a re-implementation based on the UpDown image captioning model trained on the COCO dataset (only), and with added support of decoding using Constrained Beam Search.

predictions generated by updown model

Citation

If you find this code useful, please consider citing our paper, the paper which proposed original model, and EvalAI — the platform which hosts our evaluation server. All bibtex available in CITATION.md.

Usage Instructions

Extensive documentation available at nocaps.org/updown-baseline. Use it as an API reference to navigate through and build on top of our code.

Results

Pre-trained checkpoints with the provided configs in (configs directory) are available to download:

UpDown Captioner (no CBS):

Checkpoint (.pth file): updown.pth
Predictions on nocaps val: updown_nocaps_val.json

Note: While CBS is inference-only technique, it cannot be used on this checkpoint. CBS requires models to have 300-dimensional froze GloVe embeddings, this checkpoint has 1000- dimensional word embeddings which are learned during training.

<table> <tr> <th colspan="2">in-domain</th> <th colspan="2">near-domain</th> <th colspan="2">out-of-domain</th> <th colspan="6">overall</th> </tr> <tr> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>BLEU1</th><th>BLEU4</th><th>METEOR</th><th>ROUGE</th><th>CIDEr</th><th>SPICE</th> </tr> <tr> <td>78.1</td><td>11.6</td> <td>57.7</td><td>10.3</td> <td>31.3</td><td>8.3</td> <td>73.7</td><td>18.3</td><td>22.7</td><td>50.4</td><td>55.3</td><td>10.1</td> </tr> </table>

UpDown Captioner + Constrained Beam Search:

Checkpoint (.pth file): updown_plus_cbs.pth

Note: Since CBS is inference-only technique, this particular checkpoint can be used without CBS decoding. It yields similar results to the UpDown Captioner trained using learned word embeddings during training.

With CBS Decoding:

Predictions on nocaps val: updown_plus_cbs_nocaps_val_with_cbs.json

<table> <tr> <th colspan="2">in-domain</th> <th colspan="2">near-domain</th> <th colspan="2">out-of-domain</th> <th colspan="6">overall</th> </tr> <tr> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>BLEU1</th><th>BLEU4</th><th>METEOR</th><th>ROUGE</th><th>CIDEr</th><th>SPICE</th> </tr> <tr> <td>78.6</td><td>12.1</td> <td>73.5</td><td>11.5</td> <td>68.8</td><td>9.8</td> <td>75.8</td><td>17.5</td><td>22.7</td><td>51.1</td><td>73.3</td><td>11.3</td> </tr> </table>

Without CBS Decoding:

Predictions on nocaps val: updown_plus_cbs_nocaps_val_without_cbs.json

<table> <tr> <th colspan="2">in-domain</th> <th colspan="2">near-domain</th> <th colspan="2">out-of-domain</th> <th colspan="6">overall</th> </tr> <tr> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>CIDEr</th><th>SPICE</th> <th>BLEU1</th><th>BLEU4</th><th>METEOR</th><th>ROUGE</th><th>CIDEr</th><th>SPICE</th> </tr> <tr> <td>75.7</td><td>11.7</td> <td>58.0</td><td>10.3</td> <td>32.9</td><td>8.2</td> <td>73.1</td><td>18.0</td><td>22.7</td><td>50.2</td><td>55.4</td><td>10.1</td> </tr> </table>