Home

Awesome

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

This repository hosts the code and pre-trained models for our paper FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Also, it hosts the data annotations for our NAACL paper On the origin of hallucination in dialogue systems. For more information, please visit the project page.

<!-- Thanks for your interest in our repo! --> <!-- We were inspired by SimCSE to organize this repo! 🖖 -->

**************************** Updates ****************************

Quick Links

Overview

The goal of information-seeking dialogue is to respond to user queries with natural language utterances that are grounded on knowledge sources. Dialogue systems, however, often hallucinate, i.e. generate unsupported utterances, as they amplify the noise found in existing training datasets. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues. Annotators were asked to edit the hallucinated utterances in a pre-existing dataset to ensure they are faithful to knowledge sources and re-purpose the role of the interlocutor from a human wizard to a domain-expert bot.

Data

The dataset is hosted on Huggingface's datasets:

from datasets import load_dataset

dataset = load_dataset("McGill-NLP/FaithDial")

Use with Huggingface

We'll release our fine-tuned models soon! Stay tuned!

Train Your Models

The code for all the models in the paper is available in models, which can be used to reproduce our results or to train your own models.

Requirements

First, install Pytorch 1.7+ from the official website and then, clone this repository and install the dependencies:

git clone git@github.com:McGill-NLP/FaithDial.git
pip install -r requirements.txt

Our code is tested with Python 3.8, and Pytorch 1.7.1 with CUDA 11.0.

Data Format

By default, our code loads data from the Huggingface's datasets. But, you can also provide your own data with the following format:

[
  {
    "utterances": [
      ... // prior utterances, 
      {
        "history": [
          "Have you ever been to a concert? They're so fun!",
          "No I cannot as a bot. However, have you been to Madonna's? Her 10th concert was used to help her 13th album called \"Rebel Heart\".",
          "Yeah I've heard of it but never went or what it was for. Can you tell me more about it?"
        ],
        "speaker": "Wizard",
        "knowledge": "It began on September 9, 2015, in Montreal, Canada, at the Bell Centre and concluded on March 20, 2016, in Sydney, Australia at Allphones Arena.",
        "original_response": "It started in September of 2015 and ran all the way through March of 2016. Can you imagine being on the road that long?",
        "response": "Sure. The concert started in September 9th of 2015 at Montreal, Canada. It continued till 20th of March of 2016, where it ended at Sydney, Australia.",
        "BEGIN": [
          "Hallucination",
          "Entailment"
        ],
        "VRM": [
          "Disclosure",
          "Question"
        ]
      }, 
      ... // more utterances
    ]
  }, 
  ... // more dialogues
]

In the above example, original_response, BEGIN, and VRM are optional and don't have to be provided for your own data.

Training

Here is how to train a model:

python models/dialog.py --model_name_or_path t5-base \ 
  --do_train \
  --output_dir /path/to/output_dir \
  --fp16 \
  --train_batch_size 16 \
  --num_train_epochs 10 \
  --warmup_ratio 0.04 \
  --max_seq_length 512

To run on multiple GPUs, set CUDA_VISIBLE_DEVICES. By default, training early stops and the best model is saved at /path/to/output_dir/best_model.

Other arguments for training are as follows:

For a complete list of arguments, take a look at models/dialog.py and models/lightning_base.py.

Evaluation

To compute perplexity of a model on the validation data, simply run:

python models/dialog.py --model_name_or_path /path/to/model/best_model \
  --do_eval \
  --eval_batch_size 16

For the test data, --do_eval should be replaced with --do_test. Note that evaluation should be run on a single GPU.

To compute other metrics (BLEU, ROUGE, F1, BERTScore, and Q^2), reported in the paper, we used the scripts, provided in https://github.com/orhonovich/q-squared.

Generation

To generate a response, simply run:

python models/generate.py --model_name_or_path /path/to/model/best_model --do_sample --top_p 0.6

Arguments for generation are as follows:

For a complete list of arguments, refer to models/generate.py.

Critic

We also use our collected data to frame the problem of identifying hallucination as a binary classification task where the goal is to predict whether an utterance is faithful or not, given the source knowledge. We call this model FaithCritic.

Huggingface

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/roberta-large-faithcritic", return_tensors="pt")
model = AutoModelForSequenceClassification.from_pretrained("McGill-NLP/roberta-large-faithcritic")

knowledge = "A cardigan is a type of knitted garment (sweater) that has an open front."
response = "The old version is the regular one, knitted garment that has open front and buttons!"
input = tokenizer(knowledge, response)
print(torch.argmax(model(**input).logits))

Training

python models/critic.py --model_name_or_path roberta-large --do_train --train_batch_size 16 \
    --learning_rate 1e-5 --weight_decay 0.1 --warmup_ratio 0.08 --pad_to_multiple_of 8 --fp16 \
    --output_dir /path/to/output

Testing

python models/critic.py --model_name_or_path /path/to/model --eval_batch_size 16 --do_test

To test on other datasets, you need to pass --test_task {BEGIN|MNLI}. For BEGIN and MNLI, --test_dataset_path is required and can be downloaded from here and here, respectively. For MNLI, it is possible to use the version that is hosted on :hugs: Datasets by not passing --test_dataset_path, but the results would be slightly different.

Bugs or questions?

If you have any questions (:question:) related to the code, or encounter any problems (:hammer_and_wrench:), or want to report a bug (:bug:), feel free to open an issue.

Citation

If you want to cite our papers, please use:

@article{dziri2022faithdial,
  title = "{FaithDial: A Faithful Benchmark for Information-Seeking Dialogue}",
  author = {Dziri, Nouha and Kamalloo, Ehsan and Milton, Sivan and Zaiane, Osmar and Yu, Mo and Ponti, Edoardo M and Reddy, Siva},
  journal = {Transactions of the Association for Computational Linguistics},
  volume = {10},
  pages = {1473--1490},
  year = {2022},
  month = {12},
  publisher = {MIT Press},
  doi={10.1162/tacl_a_00529}
}

and

@inproceedings{dziri2022origin,
  title = "On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?",
  author = {Dziri, Nouha and Milton, Sivan and Yu, Mo and Zaiane, Osmar and Reddy, Siva},
  booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
  year = {2022},
  pages = "5271--5285",
  address = "Seattle, United States",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2022.naacl-main.387"
}

Bibkey in aclanthology: dziri-etal-2022-origin.

License

This work is licensed under the MIT license. See LICENSE for details.