Home

Awesome

multirefeval

Data and evaluation script for the paper "Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References" in SIGDIAL 2019 - https://arxiv.org/abs/1907.10568

Data

Code

We have provided evaluation script to use for multi-reference and single reference evaluation using the data above. The code is present in the Code folder.

Dependencies-

The code has following package dependencies -
sumeval - pip install sumeval
Maluuba's nlgeval - link

Files

score_multiref.py - Code to run multi-reference evaluation
hredf.txt - Sample model output file
jsons/test.tgt - Single reference file
jsons/test_duid_mapping.json - File contains mapping from context id to line number (more details below)

Line number - context id mapping

The test dataset consists of 1000 dialogues, which leads to 6740 context-reply pairs. For e.g., if a dialog has 10 utterances, it corresponds to 9 context-reply pairs (the last utterance does not lead to a reply).
Any model generated outputs will have 6740 lines corresponding to the contexts, as is the case of the sample hredf.txt file. We map the context id which is the dialog id concatenated with an utterance id to the lines in test output file using a python dictionary present in test_duid_mapping.json file

Running evaluation script

For evaluation, you can use the command python score_multiref.py --pred_file hredf.txt The arguments for this file are following -