Awesome

RoViST: Learning Robust Metrics for Visual Storytelling

This repository contains code for paper Learning Robust Metrics for Visual Storytelling.

<div align="center"> Wang, E., Han, C., & Poon, J. (2022). <br> Learning Robust Metrics for Visual Storytelling <br> Findings of NAACL 2022 </div>

1. Introduction

Visual storytelling (VST) is the task of generating a story paragraph that describes a given image sequence. Most existing storytelling approaches have evaluated their models using traditional natural language generation metrics like BLEU or CIDEr. However, such metrics based on n-gram matching tend to have poor correlation with human evaluation scores and do not explicitly consider other criteria necessary for storytelling such as sentence structure or topic coherence. Moreover, a single score is not enough to assess a story as it does not inform us about what specific errors were made by the model.

In this work, we propose 3 evaluation metrics sets that analyses which aspects we would look for in a good story:

Visual Grounding: generating text relevant to the image content but unlike image captioning, there is less emphasis on describing relationships between objects and may contain concepts that are inferred from the image.
Coherence: the story must be topically coherent, similar to how a human would tell a story in a social setting. Sentences should not sound disjointed e.g. ‘We went to the park. I grew up in Sydney’.
Non-redundancy: avoids repetition which appears to be a common issue in current VST models e.g. ‘we had a good time and had a great time!’

2. Setup

As the code format is .ipynb, there are no settings but the Jupyter notebook with GPU.

3. Inference Notebook

To calculate your own scores, follow the instructions in the Demo notebook files. The code can be run on Google Colab. RoViST_VG_Demo can be used to calculate the Visual Grounding scores and RoViST_C_NR can be used to calculate the Coherence and Non-redundancy scores.

4. Reference

If you use this code for your research, please cite:

@inproceedings{wang-etal-2022-rovist,
    title = "{R}o{V}i{ST}: Learning Robust Metrics for Visual Storytelling",
    author = "Wang, Eileen  and
      Han, Caren  and
      Poon, Josiah",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.206",
    doi = "10.18653/v1/2022.findings-naacl.206",
    pages = "2691--2702",
}