Awesome

Factual Consistency in Summarization

Can you tell which edits of summaries are consistent, and which are inconsistent?

Here is the updated benchmark, with the latest LLMs (Gemini-pro added on 12/14/2023)

Model Name	Podcast	Bill Sum	Sam Sum	News	Sales Call	Sales Email	Shake speare	Sci TLDR	QMSumm	ECT Sum	Overall
Llama2-7b	50	50	50	50.6	50.9	50	50	50	50.7	51.4	50.4
Dav001	53.3	50.2	51	54.4	55.5	52.5	50	51	50.1	50.9	51.9
DAE	54.4	55.1	58.7	60.9	50.4	53.6	53.6	54.7	52	58.3	55.2
Cohere-cmd-xl	51.1	52.7	51.3	52.6	60.2	59.4	50	60.5	54.5	60.5	55.3
Vicuna-13b	52.8	52.5	51.3	63.5	57.9	51.8	55.4	59.7	54	62.4	56.1
SummaCConv	58.1	55.2	53.1	61.9	59	53.7	59.3	59.7	53.5	57.9	57.1
Mistral-7b	50	55.5	56.7	59.8	63.4	59.7	53.5	59.6	55.9	63.7	57.8
Llama2-13b	51.3	54.6	57.2	59.3	63.1	58.1	58.6	63.4	56.5	61.4	58.4
Claudev13	60.4	51.9	64.5	63.4	61.3	57	58.1	57.8	56.9	68.1	59.9
Dav002	56.4	53.9	57.1	61.9	65.1	59.1	56.6	64.6	60.6	66.2	60.1
Bard	50	58.1	61.3	71.6	73.3	70.6	58.7	66	53.9	72.7	63.6
QAFactEval	63.7	54.2	66.2	74.4	68.4	63.6	61.6	67.5	62.4	72.6	65.5
PaLM-bison	66	62	69	68.4	74.4	68.1	61.6	78.1	70.4	72.4	69
Dav003	65.7	59.9	67.6	71	78.8	69.2	69.7	74.4	72.2	77.8	70.6
CGPT	68.4	63.6	69.1	74.4	79.4	65.5	68	75.6	69.2	78.6	71.2
Claudev2	68.7	61.7	75.4	75.5	81	67.4	74	78.1	74.8	79.2	73.6
Claudev21	72.6	66	75.7	77.2	82	68.5	73.2	78.6	72.7	77.1	74.4
Gemini-pro	73.7	60.2	75.7	77.6	86.9	74.2	71.9	77.6	74	83.1	75.5
GPT4	82.7	71.1	83.1	83.3	87.9	79.5	84	82.4	79.6	87	82.1
Human Perf.	90.8	87.5	89.4	90	91.8	87.4	96.9	89.3	90.7	95.4	90.9

We release the data for the 10 domains in the SummEdits benchmark in the data/summedits folder.

The SummEdits_Benchmark.ipynb notebook provides information on how to access open, and visualize the dataset.

As part of the paper, we annotated 3.6k explanations generated by models justifying their choice to identify a summary as inconsistent. The annotations are available in data/factcc/factcc_explanation_annotation.json. The notebook FactCC_Explanation_Annotation.ipynb shows how to load/view the annotations.

We release all prompts that were used in experiments in the paper in the prompts/ folder. More specifically:

summedits/factcc is a folder that contains the 26 prompts that we experimented with in initial FactCC experiments (Section 3.1)
summedits/step2_consistent.txt and summedits/step2_inconsistent.txt were the prompts used in Step 2 of the SummEdits protocol to generate edits of seed summaries. (Section 5.2)
summedits/standard_zs_prompt.txt is the zero-shot prompt that was used to assess all LLM model performance on the SummEdits benchmark. (Section 6.3)
summedits/edit_typing_gpt4.txt is a few-shot prompt used to predict the types of edits for inconsistent samples in SummEdits (Section 6.4)