Awesome
FRANK: Factuality Evaluation Benchmark
This repository contains the data for the FRANK Benchmark for factuality evaluation metrics (see our NAACL 2021 paper for more information). The data combines outputs from 9 models on 2 datasets with a total of 2250 annotated model outputs. We chose to conduct the annotation on recent systems on both CNN/DM and XSum datasets providing a large variety of data and factual errors.
The annotation was conducted based on a typology of factual errors which is described in detail in our paper. Thanks to this fine-grained annotation scheme, the annotations we collected can be used to compare specific strength and weaknesses of factuality metrics.
The leaderboard website is accessible here https://frank-benchmark.herokuapp.com
Updates and Fixes
- 7/20/2021 We fixed an issue with BertScore results reported in the paper. The new results are live on the leaderboard and the file with baseline results
baseline_factuality_metrics_outputs.json
has been updated. It appears that overall BertScore P Art performs best overall in terms of Pearson while FactCC is better in terms of Spearman. However, we also observe BertScore primarily focuses on Content verifiability errors and is not as strong at Semantic Frame errors and Discourse errors. - 6/16/2021 Validation-Test splits for FRANK.
Data
The data
repository contains the data to run new evaluation metrics and the collected human judgements to compute correlations and anaylsis. All the data comes from the test split of each dataset. We use the hashes from the original datasets to identify the documents.
Validation-Test Split for FRANK
The FRANK paper presents results on the entire FRANK dataset since the metrics were not tuned for the FRANK benchmark. However, we expect some tuning in future work. For this reason, we split the data in validation
and test
. All tuning and experimentation should be performed on the validation set, while the performance results should be reported on the test
set. The validation set contains summaries from 149 articles (671 summaries) and the test set contains summaries from 350 articles (1575 summaries).
The files all include a split
field which indicates whether the datapoint is part of the validation
or test
set. The fileds test_split.txt
and validation_split.txt
contain the list of test and validation hashes. Note that a hash corresponds to an article and all summaries of the same article are in the same split.
Data description
We describe the contents of the data files below. Note that all files contain a split
field which indicates whether the datapoint is part of the validation or test split of FRANK.
benchmark_data.json
is the data on which new evaluation metrics have to be executed. It is a list, each element contains a model-generated summary, a reference summary, and an article. There are additional fields used to track the element:hash
or identifyier in the dataset the article was taken from, andmodel_name
with the name of the model used to generate the summary.human_annotations.json
contains one record for each article/model pair. It has aFactuality
field which is the total human judgement assigned to the summary. This is a score between 0 and 1 as we collected judgements on each sentence and average over sentences. The rest of the fields correspond to individual errors or groups of errors. A 1 indicates that there was no such errors in the summary, a 0 indicates that every sentence contained one such error. We also include fields with "flipped" labels for each category. These fields can be used for the ablation study to determine the influence of each category in the overall correlation with human judgement.human_annotations_sentences.json
contains all human annotations that we collected on the system outputs at the sentence level and for the each annotator (anonymized). We use the same naming convention as in the paper to indicate categories of errors. In addition, "NoE" indicates no error, and "OtherE" indicates an error outside of the typology. This file has the same fields ashuman_annotations.json
with two additional fields:summary_sentence
(the result of running spacy's sentence boundary detection) andsummary_sentences_annotations
which contains the annotations for each sentence. The latter is a list where each element corresponds to a sentence and contains the annoations by the three annoators. Note that an annotator can select more than one category of error if they identify more than one error.selected_documents.txt
is a list of hashes of the documents that were selected to be part of the FRANK benchmark.baseline_factuality_metrics_outputs.json
contain the ouputs of running several evaluation metrics on the benchmark data. These results were used to obtain the correlation numbers in the paper and are helpful to compare new metrics to those previously proposed.
Evaluation
To evaluate new metrics we assess their partial correlation with human judgements. We use partial correlation using the summarization system as control variable.
The file evaluate.py
can be used to compute partial correlations along with the Williams test to assess if the difference between two metrics is statistically significant. evaluate.py
can also be used to conduct a more detailed analysis of the performance of the factuality metric. In particular, it can be used to compute an ablation study and estimate how much the metric is able to capture each category of error. The categories of error are described in the typology defined in our paper.
The online leaderboard uses the evaluation scripts in evaluate.py
to evaluate the metrics.
Validation-Test Splits for FRANK
We split the data in validation
and test
. All tuning and experimentation should be performed on the validation set, while the performance results should be reported on the test
set.
Usage
To install requirements:
git clone https://github.com/artidoro/frank.git
cd frank
pip install -r requirements.txt
usage: evaluate.py [-h]
[--mode {hm-correlation,ablations,ablations-plot,mm-correlation}]
[--human_eval_path HUMAN_EVAL_PATH]
[--baseline_metrics_outputs BASELINE_METRICS_OUTPUTS]
[--baseline_metrics BASELINE_METRICS [BASELINE_METRICS ...]]
[--no_baseline_metrics]
[--metrics_outputs METRICS_OUTPUTS]
[--metrics_outputs_info METRICS_OUTPUTS_INFO]
[--ablations ABLATIONS [ABLATIONS ...]] [--human HUMAN]
[--no_partial_correlation]
[--partial_correlation_variable PARTIAL_CORRELATION_VARIABLE]
[--store_path STORE_PATH] [--dataset {None,cnndm,bbc}]
[--model_name MODEL_NAME [MODEL_NAME ...]]
[--split {valid,test,all}]
Arguments for the evaluation script.
optional arguments:
-h, --help show this help message and exit
--mode {hm-correlation,ablations,ablations-plot,mm-correlation}
This script can calculate correlation with human
judgments (hm-correlation), evaluate the performance
of the evaluation metrics at capturing different
types of factual errors (ablations), output the
ablation as a plot (ablations-plot), and compute the
Williams test (mm-correlation)
--human_eval_path HUMAN_EVAL_PATH
file containing human annotations expects csv file.
--baseline_metrics_outputs BASELINE_METRICS_OUTPUTS
file name containing outputs of baseline factuality
metrics.
--baseline_metrics BASELINE_METRICS [BASELINE_METRICS ...]
baseline metrics to evaluate on (should match the
name in the baseline metrics output file).
--no_baseline_metrics
If set, does not evaluate the baseline metrics
--metrics_outputs METRICS_OUTPUTS
names of json files containing metric outputs with
key "score"
--metrics_outputs_info METRICS_OUTPUTS_INFO
json file describing how to parse metrics output
files. This allows to customize the name of the score
key and to have several metrics in one json file.
--ablations ABLATIONS [ABLATIONS ...]
column names for ablations.
--human HUMAN column for human judgements.
--no_partial_correlation
--partial_correlation_variable PARTIAL_CORRELATION_VARIABLE
what column to use as confounding to calculate
partial correlations
--store_path STORE_PATH
--dataset {None,cnndm,bbc}
if None use all data
--model_name MODEL_NAME [MODEL_NAME ...]
by default use all data, availble model names
['bart', 'pgn', 'bus', 'bert_sum', 's2s', 'TranS2S',
'TConvS2S', 'PtGen', 'BERTS2S']
--split {valid,test,all}
Whether to use validation or test splits of FRANK.
For experimentations only use validation set. Using
all the data is only recommended for analysis of
types of error.
To run on the baseline metrics on the validation
set:
python evaluation/evaluate.py
To run on the baseline metrics on the test
set:
python evaluation/evaluate.py --split test
An example submission file is example_benchmark_data_scored.json
. You can evaluate it with or without baseline metrics using:
python evaluation/evaluate.py --metrics_outputs data/example_benchmark_data_scored.json
python evaluation/evaluate.py --metrics_outputs data/example_benchmark_data_scored.json --no_baseline_metrics
If you want to specify how to parse the metric outputs using a json file, you can use the metrics_outputs_info
argument. example_metrics_outputs_info.json
is a example file that defines how to parse the example_benchmark_data_scored.json
. You would use it as follows:
python evaluation/evaluate.py --metrics_outputs_info data/example_metrics_outputs_info.json
To use different modes, use the argument mode
. For example, using the ablations
mode the script measure how much a metric captures a given type of errors. This is done by computing the negative difference between the partial correlation when flipping the label of one type of error and that without flipping the label. Using the ablations-plot
generates a plot of the ablations on the selected categories of error.
python evaluation/evaluate.py --mode ablations --metrics_outputs data/example_benchmark_data_scored.json
To compare different metrics, one should test whether their difference is statistically significant. This can be done with the williams test taking into account the metric-metric correlations. Using the mode mm-correlation
the script computes the the Williams test and the metric-metric correlations.
python evaluation/evaluate.py --mode mm-correlation
One can specify the baseline metrics used for the analysis using the baseline_metrics
argument. Similarly, the ablations
arguments specifies which error category to use to compute the ablation study. Note that these can both be changed in the code directly and the other options are commented out for simplicity.
Finally, the code also allows to customize the data split used for the computation of the statistics (dataset and model), the variable used to compute partial correlation, and whether to store the outputs of this tool. See the argument definition for additional help with these options.
Submission to the Leaderboard
Submit your metric output using this Google Form. Allow one week to have the results display on the online leaderboard.
Submission format
We expect a .json
file like benchmark_data.json
in the data
directory with each element having an additional field score
which will store the score returned by your metric on the corresponding summary/article pair.
You can verify that the evaluate.py
script works with your benchmark_data.json
:
python evaluation/evaluate.py --no_baseline_metrics --metrics_outputs data/example_benchmark_data_scored.json
If you have any questions feel free to submit an issue.
Annotation Tools
The repository with the annotation platoform can be found at https://github.com/artidoro/frank-annotation-platform.
Citation
@inproceedings{pagnoni-etal-2021-understanding,
title = "Understanding Factuality in Abstractive Summarization with {FRANK}: A Benchmark for Factuality Metrics",
author = "Pagnoni, Artidoro and
Balachandran, Vidhisha and
Tsvetkov, Yulia",
booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.naacl-main.383",
doi = "10.18653/v1/2021.naacl-main.383",
pages = "4812--4829",
abstract = "Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics cannot be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights on the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets. Through these annotations we identify the proportion of different categories of factual errors and benchmark factuality metrics, showing their correlation with human judgement as well as their specific strengths and weaknesses.",
}