Awesome
Summarization Repository
Authors: Alex Fabbri*, Wojciech Kryściński*, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev<br/>
This project is a collaboration work between Yale LILY Lab and Salesforce Research. <br/><br/>
<p align="center"> <img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-lily.png" height="100" alt="LILY Logo" style="padding-right:160"> <img src="https://raw.githubusercontent.com/Yale-LILY/SummEval/master/assets/logo-salesforce.svg" height="100" alt="Salesforce Logo"> </p><sub><sup>* - Equal contributions from authors</sup></sub>
Table of Contents
Updates
04/19/2020 - Updated the human annotation file to include all models from paper and metric scores.<br/> 04/19/2020 - SummEval is now pip-installable! Check out the pypi page.<br/> 04/09/2020 - Please see this comment with code for computing system-level metric correlations! <br/> 11/12/2020 - Added the reference-less BLANC and SUPERT metrics! <br/> 7/16/2020 - Initial commit! :)
Data
As part of this release, we share summaries generated by recent summarization model trained on the CNN/DailyMail dataset here.</br> We also share human annotations, collected from both crowdsource workers and experts here.
Both datasets are shared WITHOUT the source articles that were used to generate the summaries. <br/> To recreate the full dataset please follow the instructions listed here.
Model Outputs
IMPORTANT:
All model outputs were obtained from the original authors of the models and shared with their consent.<br/> When using any of the model outputs, please also cite the original paper.
Human annotations
Human annotations of model generated summaries can be found here.
The annotations include summaries generated by 16 models from 100 source news articles (1600 examples in total). <br/> Each of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total). <br/> Summaries were evaluated across 4 dimensions: coherence, consistency, fluency, relevance. <br/> Each source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.
Data preparation
Both model generated outputs and human annotated data require pairing with the original CNN/DailyMail articles.
To recreate the datasets follow the instructions:
- Download CNN Stories and Daily Mail Stories from https://cs.nyu.edu/~kcho/DMQA/
- Create a
cnndm
directory and unpack downloaded files into the directory - Download and unpack model outputs or human annotations.
- Run the
pair_data.py
script to pair the data with original articles
Example call for model outputs:
python3 data_processing/pair_data.py --model_outputs <file-with-data-annotations> --story_files <dir-with-stories>
Example call for human annotations:
python3 data_processing/pair_data.py --data_annotations <file-with-data-annotations> --story_files <dir-with-stories>
Evaluation Toolkit
We provide a toolkit for summarization evaluation to unify metrics and promote robust comparison of summarization systems. The toolkit contains popular and recent metrics for summarization as well as several machine translation metrics.
Metrics
Below are the metrics included in the tookit, followed by the associated paper and code used within the toolkit:
SETUP
You can install summ_eval via pip:
pip install summ-eval
You can also install summ_eval from source:
git clone https://github.com/Yale-LILY/SummEval.git
cd evaluation
pip install -e .
You can test your installation (assuming you're in the ./summ_eval
folder) and get familiar with the library through tests/
python -m unittest discover
Command-line interface
We provide a command-line interface calc-scores
which makes use of gin config files to set metric parameters.
Examples
Run ROUGE on given source and target files and write to rouge.jsonl
, analogous to files2rouge.
calc-scores --config-file=examples/basic.config --metrics "rouge" --summ-file summ_eval/1.summ --ref-file summ_eval/1.ref --output-file rouge.jsonl --eos " . " --aggregate True
NOTE: if you're seeing slow-ish startup time, try commenting out the metrics you're not using in the config; otherwise this will load all modules.
Run ROUGE and BertScore on a .jsonl
file which contains reference
and decoded
(i.e., system output) keys and write to output.jsonl
.
calc-scores --config-file=examples/basic.config --metrics "rouge, bert_score" --jsonl-file data.jsonl --output-file rouge_bertscore.jsonl
For a full list of options, please run:
calc-scores --help
For use in scripts
If you want to use the evaluation metrics as part of other scripts, we have you covered!
from summ_eval.rouge_metric import RougeMetric
rouge = RougeMetric()
Evaluate on a batch
summaries = ["This is one summary", "This is another summary"]
references = ["This is one reference", "This is another"]
rouge_dict = rouge.evaluate_batch(summaries, references)
Evaluate on a single example
rouge_dict = rouge.evaluate_example(summaries[0], references[0])
Evaluate with multiple references
Currently the command-line tool does not use multiple references for simplicity. Each metric has a supports_multi_ref
property to tell you if it supports multiple references.
print(rouge.supports_multi_ref) # True
multi_references = [["This is ref 1 for summ 1", "This is ref 2 for summ 1"], ["This is ref 1 for summ 2", "This is ref 2 for summ 2"]]
rouge_dict = rouge.evaluate_batch(summaries, multi_references)
Citation
@article{fabbri2020summeval,
title={SummEval: Re-evaluating Summarization Evaluation},
author={Fabbri, Alexander R and Kry{\'s}ci{\'n}ski, Wojciech and McCann, Bryan and Xiong, Caiming and Socher, Richard and Radev, Dragomir},
journal={arXiv preprint arXiv:2007.12626},
year={2020}
}
Get Involved
Please create a GitHub issue if you have any questions, suggestions, requests or bug-reports. We welcome PRs!