Home

Awesome

SQuALITY

This repo contains the SQuALITY (Summary-format QUestion Answering with Long Input Texts, Yes!) dataset and supporting code. SQuALITY is a question-focused, long-document, multi-reference summarization dataset. The source documents are short stories from Project Gutenberg on the order of 4000-6000 words long. The stories are split such that stories in this dataset that also appear in the QuALITY dataset are assigned to the same split. Each story is paired with a set of five questions, the first of which is always "What is the plot of the story?" Each question has four reference summaries, all of which are written by writers from Upwork and NYU undergraduates who consented to having their writing distributed for research purposes.

Authors

Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, Samuel R. Bowman

Data and Format

The dataset lives in data. There are currently two versions of the dataset:

Each data file ({train/dev/test}.jsonl) is formatted as a JSON lines file. Each row in the data file is a JSON dictionary with the following fields:

Baselines

A preliminary script to train our baselines are available in run_summarization.py.

Human Evaluation Data

Human evaluation data is available in data/human-eval.

License

The stories are distributed under the Project Gutenberg license and the summaries are distributed under a CC BY license, in data/LICENSE.

Acknowledgements

This project has benefited from financial support to SB by Eric and Wendy Schmidt (made by recommendation of the Schmidt Futures program) and Apple, and from in-kind support by the NYU High-Performance Computing Center and Google Cloud. This material is based upon work supported by the National Science Foundation under Grant Nos. 1922658 and 2046556. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Citation

@article{wang2022squality,
  title={S{Q}u{ALITY}: Building a Long-Document Summarization Dataset the Hard Way},
  author={Wang, Alex and Pang, Richard Yuanzhe and Chen, Angelica and Phang, Jason and Bowman, Samuel R.},
  journal={arXiv preprint 2205.11465},
  year={2022}
}

Contact

Open an issue on this repo or email wangalexc at gmail.com