Home

Awesome

Reddit-QA-corpus-

A question and answer corpus from the /r/askreddit subreddit. Designed for training seq2seq neural networks. The total corpus is 4,976,760 question and answer pairs.

Creation

Data was gathered from http://pushshift.io reddit data dump on the 15/06/2018.

All submission to the subreddit where taken as questions, and the first comment on each submission was taken as the answer.

Cleaning

Data

Each file is just under half a GB, hence why it's stored on a remote server.

Numbers

7,102,717 Questions 12,039,795 Answers

4,976,759 Questions and Answers pairs

Referencing

If used for academic purposes please contact fionnd [at] pm.me for full refencing infomation.

@misc{redditqa,
    author = "{Fionn Delahunty}",
    title = {{Reddit QA Corpus}},
    howpublished = {\url{https://github.com/FionnD/Reddit-QA-Corpus}},
    note = {Online; accessed XXX} ,
    year=2018,
}