Home

Awesome

German Wikipedia Text Corpus

A more recent version of the text corpus is published here: https://github.com/GermanT5/wikipedia2corpus

This is a German text corpus from Wikipedia. It is cleaned, preprocessed and sentence splitted. Its purpose is to train NLP embeddings like fastText or ELMo Deep contextualized word representations.

The advantage of this text corpus is that it does not only contain the article space of the wiki, but also the comments for a larger text corpus and a more sloppy language. This should improve the quality of downstream tasks when you process conversations like mails, chats, tweets or support tickets.

How this corpus has been generated

We used a wikipedia dump as the data source.

Then the tool WikiExtractor was used to extract the xml dump. To also include the discussion, the WikiExtractor tool has been modified:

def keepPage(ns, page):
    if ns != '0' and ns != '1': # Aritcle and Talk
        print('skipped ns:', ns)
        return False
    # remove disambig pages if desired
    if options.filter_disambig_pages:
        for line in page:
            if filter_disambig_page_pattern.match(line):
                return False
    return True

Now some hand-crafted python tool was used for further processing: https://github.com/PhilipMay/de-wiki-text-corpus-tools/blob/master/process_wiki_files.py

Everything has been shuffled on sentence-level with linux shuf command.

Download

You can download the texts here:

Unpack

Using these commands, you can unpack the files (Linux and macOS):

cat wiki-all-shuf.tgz.part-* > wiki-all-shuf.tgz
tar xvfz wiki-all-shuf.tgz

License

As Wikipedia itself, this is published under Creative Commons Attribution-ShareAlike 3.0 Unported license.