Home

Awesome

Factual Consistency in Summarization

Can you tell which edits of summaries are consistent, and which are inconsistent?

<p align="center"> <img width="650" src="images/summedits_examples.png"> </p>

SummEdits Benchmark

Here is the updated benchmark, with the latest LLMs (Gemini-pro added on 12/14/2023)

Model NamePodcastBill SumSam SumNewsSales CallSales EmailShake speareSci TLDRQMSummECT SumOverall
Llama2-7b50505050.650.950505050.751.450.4
Dav00153.350.25154.455.552.5505150.150.951.9
DAE54.455.158.760.950.453.653.654.75258.355.2
Cohere-cmd-xl51.152.751.352.660.259.45060.554.560.555.3
Vicuna-13b52.852.551.363.557.951.855.459.75462.456.1
SummaCConv58.155.253.161.95953.759.359.753.557.957.1
Mistral-7b5055.556.759.863.459.753.559.655.963.757.8
Llama2-13b51.354.657.259.363.158.158.663.456.561.458.4
Claudev1360.451.964.563.461.35758.157.856.968.159.9
Dav00256.453.957.161.965.159.156.664.660.666.260.1
Bard5058.161.371.673.370.658.76653.972.763.6
QAFactEval63.754.266.274.468.463.661.667.562.472.665.5
PaLM-bison66626968.474.468.161.678.170.472.469
Dav00365.759.967.67178.869.269.774.472.277.870.6
CGPT68.463.669.174.479.465.56875.669.278.671.2
Claudev268.761.775.475.58167.47478.174.879.273.6
Claudev2172.66675.777.28268.573.278.672.777.174.4
Gemini-pro73.760.275.777.686.974.271.977.67483.175.5
GPT482.771.183.183.387.979.58482.479.68782.1
Human Perf.90.887.589.49091.887.496.989.390.795.490.9

SummEdits Benchmark Release (Section 6-7)

We release the data for the 10 domains in the SummEdits benchmark in the data/summedits folder.

The SummEdits_Benchmark.ipynb notebook provides information on how to access open, and visualize the dataset.

FactCC Explanation Analysis (Section 3.5)

As part of the paper, we annotated 3.6k explanations generated by models justifying their choice to identify a summary as inconsistent. The annotations are available in data/factcc/factcc_explanation_annotation.json. The notebook FactCC_Explanation_Annotation.ipynb shows how to load/view the annotations.

Prompts

We release all prompts that were used in experiments in the paper in the prompts/ folder. More specifically: