Home

Awesome

Chat corpus repository

This is a chat corpus collection from various open sources, all files are composed of question-answer pairs, where odd lines are questions, even lines are answers.

I use them for training chatbot on seq2seq model. theory: http://arxiv.org/abs/1406.1078 implementation: https://github.com/Marsan-Ma/tf_chatbot_seq2seq_antilm.git

1. open_subtitles

English movie subtitles parsed from http://opus.lingfil.uu.se/download.php?f=OpenSubtitles/en.tar.gz

2. movie_subtitles_en

Cornell Movie-Dialogs Corpus http://www.mpi-sws.org/~cristian/Cornell_Movie-Dialogs_Corpus.html

3. lyrics_zh

lyrics from PTT forum https://www.ptt.cc/bbs/lyrics/index.html

4. twitter_en

corpus scrap from twitter (700k lines), where odd lines are tweet and even lines are corresponding responded tweets. actually you could scrape your own with my twitter scraper repository

5. twitter_en big

twitter corpus in larger size (5M lines), files splitted to walkaround 100m filesize limit,
just cat them to recover the original gz file. cat twitter_en_big.txt.gz.part* > twitter_en_big.txt.gz