Awesome

Datasets for Natural Language Processing

This is a list of datasets/corpora for NLP tasks, in reverse chronological order. Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.

Areas

Question Answering
Dialogue Systems
Goal-Oriented Dialogue Systems

Question Answering

(NLVR) A Corpus of Natural Language for Visual Reasoning, 2017 [paper] [data]
(MS MARCO) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [paper] [data]
(NewsQA) NewsQA: A Machine Comprehension Dataset, 2016 [paper] [data]
(SQuAD) SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [paper] [data]
(GraphQuestions) On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [paper] [data]
(Story Cloze) A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories, 2016 [paper] [data]
(Children's Book Test) The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [paper] [data]
(SimpleQuestions) Large-scale Simple Question Answering with Memory Networks, 2015 [paper] [data]
(WikiQA) WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [paper] [data]
(CNN-DailyMail) Teaching Machines to Read and Comprehend, 2015 [paper] [code to generate] [data]
(QuizBowl) A Neural Network for Factoid Question Answering over Paragraphs, 2014 [paper] [data]
(MCTest) MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [paper] [data] [alternate data link]
(QASent) What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [paper] [data]

Dialogue Systems

(Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [paper] [data]

Goal-Oriented Dialogue Systems

(Frames) Frames: A Corpus for Adding Memory to Goal-Oriented Dialogue Systems, 2016 [paper] [data]
(DSTC 2 & 3) Dialog State Tracking Challenge 2 & 3, 2013 [paper] [data]