Home

Awesome

RCT-summarization-data

Data from our AMIA 2020 paper "Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization". Pre-print: https://arxiv.org/abs/2008.11293.

All data here is from abstracts indexed by and available on PubMed. The inputs are titles and abstracts of articles describing randomized controlled trials (RCTs); the targets are taken as the "Author Conclusions" section from the abstracts of Cochrane (https://www.cochranelibrary.com/) evidence syntheses of the same. These report an overall summary of the evidence conveyed in the constituent trials.

The data format is straightforward: We have divided the data into train, dev, and test sets. For each there are two files, comprising inputs and targets respectively. The former includes the individual trial reports (PMIDs, titles, abstracts) that correspond to a particular evidence synopsis; the latter is the target output (i.e., the authors conclusions as stated in the abstract). Note that there are multiple inputs per target (multi-document summarization), and that the number of trial reports associated with each synthesis varies. See the Jupyter notebook for a bit more explanation.

If you use this data, please cite:

@inproceedings{AMIA-summarization-2021,
    title = {{Generating (Factual?) Narrative Summaries of RCTs: Experiments with Neural Multi-Document Summarization}},
    author = {Byron C. Wallace and Sayantan Saha and Frank Soboczenski and Iain J. Marshall},
    Booktitle = {{Proceedings of AMIA Informatics Summit}},
    year = {2021},
}

Questions to: b.wallace@northeastern.edu.