Home

Awesome

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets (English, multilang)

Sources

Datasets (Albanian)

Datasets (Arabic)

Datasets (Urdu)

Datasets (German)

Datasets (Kinyarwanda and Kirundi)