Home

Awesome

Gitter GitHub license

German Transformer Training

The goal of this repository is to plan the training of German transformer models.

1. Datasets / Data Sources

DatasetRaw Size /CharactersQuality/Filtered?URLNotes/StatusDupe FactorTotal = 178 GB
German Wikipedia Dump + Comments5.4 GB / 5.3b++1054 GB = 30 %
Oscar Corpus (Common Crawl 2018-47)145 GB / 21b Words<a href ='https://oscar-corpus.com/'>Downlaod</a>-----------
FB cc_net (Common Crawl 2019-09 )Head 75 GB+<a href ='https://github.com/facebookresearch/cc_net'>Code</a>More broadly filtered versions middle&tail available too175 GB : 42 %
EU Book Shop2.3 GB / 2.3b+511.5 GB: 6.5 %
News 20184.3 GB / 4.3b+520 GB: 11 %
Wortschatz Uni Leipzig> 20 * 200 mbPart of News 2018???<a href ='https://wortschatz.uni-leipzig.de/de/download/german'>Code</a>--------
Paracrawl3.2 GB / 3.2b---------
Open Subtitles1.3 GB / 288m Tokenso22.6 GB : 1.5 %
Open Legal Dump3.6 GB / 3.5b+<a href ='http://openlegaldata.io/research/2019/02/19/court-decision-dataset.html'>Announcment</a>Used by Deepset515 GB: 8.4 %
Corpus of German-Language Fiction (txt)2735 Prose Works<a href ='https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1'>Download</a>Old (1510-1940)

https://ofai.github.io/million-post-corpus/

Additional Sources

Data Preperation

  1. Clean Files
  2. Split in distinct sentences (2.2 Create Vocab)
  3. Tokenize

2. Training

Training

NLP Libs

Training Runs from scratch

NameStepsResult URLTraining TimeCodePaper
RoBERTa Base<a href ='https://arxiv.org/pdf/1907.11692.pdf'>RoBERTa</a>
BERT Large<a href='https://arxiv.org/pdf/1907.11692.pdf'>Github</a><a href ='https://arxiv.org/abs/1810.04805'>BERT</a>

TPU Infos

3. Evaluation Metrics

Comparison to other German & Multilingual Models

NameStepsResult URLTraining TimeCodeMetrics
Deepset German Bert Base810k (1024 SL) + 30k (512 SL)<a href ='https://deepset.ai/german-bert'>Deepset</a>9 Days TPU v2-8
Ddmdz German Bert Base1500k (512 SL)<a href ='https://github.com/dbmdz/berts#german-bert'>dbmdz</a><a href ='https://github.com/dbmdz/berts#german-bert'>dbmdz</a><a href ='https://github.com/stefan-it/fine-tuned-berts-seq#german'>stefan-it</a>
Europeana BERT<a href ='https://github.com/dbmdz/berts#german-europeana-bert'>dbmdz</a><a href ='https://github.com/dbmdz/berts#german-europeana-bert'>Europeana-bert</a>

4. Contact