Awesome
Datasets for Natural Language Processing
This is a list of datasets/corpora for NLP tasks, in reverse chronological order. Suggestions and pull requests are welcome. The goal is to make this a collaborative effort to maintain an updated list of quality datasets.
Areas
Question Answering
- (NLVR) A Corpus of Natural Language for Visual Reasoning, 2017 [paper] [data]
- (MS MARCO) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset, 2016 [paper] [data]
- (NewsQA) NewsQA: A Machine Comprehension Dataset, 2016 [paper] [data]
- (SQuAD) SQuAD: 100,000+ Questions for Machine Comprehension of Text, 2016 [paper] [data]
- (GraphQuestions) On Generating Characteristic-rich Question Sets for QA Evaluation, 2016 [paper] [data]
- (Story Cloze) A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories, 2016 [paper] [data]
- (Children's Book Test) The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations, 2015 [paper] [data]
- (SimpleQuestions) Large-scale Simple Question Answering with Memory Networks, 2015 [paper] [data]
- (WikiQA) WikiQA: A Challenge Dataset for Open-Domain Question Answering, 2015 [paper] [data]
- (CNN-DailyMail) Teaching Machines to Read and Comprehend, 2015 [paper] [code to generate] [data]
- (QuizBowl) A Neural Network for Factoid Question Answering over Paragraphs, 2014 [paper] [data]
- (MCTest) MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text, 2013 [paper] [data] [alternate data link]
- (QASent) What is the Jeopardy model? A quasisynchronous grammar for QA, 2007 [paper] [data]
Dialogue Systems
- (Ubuntu Dialogue Corpus) The Ubuntu Dialogue Corpus : A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, 2015 [paper] [data]