Home

Awesome

Wikipedia 2 Corpus

Tools to extract and clean the Wikipedia texts to transform them into a text corpus for self-supervised NLP model training. Includes also a prepared corpus for English and German language (see below).

We use WikiExtractor to extract the Wikipedia database dumps. The texts are split into sentences by using SoMaJo. Each line of the text corpus contains one single sentence. Between each Wikipedia article is a blank line.

Remove blank Lines

If you want to remove the blank lines in the text corpus you can use this command: sed -i '/^$/d' <filename>

Download the German text Corpus

Download the English text Corpus

How you can replicate our work

License

The Text Corpus

As Wikipedia itself, the text corpus is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.

The Script

Copyright (c) 2022 Philip May

Licensed under the MIT License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License by reviewing the file LICENSE in the repository.