Awesome
German Transformer Training
The goal of this repository is to plan the training of German transformer models.
1. Datasets / Data Sources
- Germeval 2017: https://sites.google.com/view/germeval2017-absa/data
Dataset | Raw Size /Characters | Quality/Filtered? | URL | Notes/Status | Dupe Factor | Total = 178 GB |
---|---|---|---|---|---|---|
German Wikipedia Dump + Comments | 5.4 GB / 5.3b | ++ | 10 | 54 GB = 30 % | ||
Oscar Corpus (Common Crawl 2018-47) | 145 GB / 21b Words | <a href ='https://oscar-corpus.com/'>Downlaod</a> | ----- | ------ | ||
FB cc_net (Common Crawl 2019-09 ) | Head 75 GB | + | <a href ='https://github.com/facebookresearch/cc_net'>Code</a> | More broadly filtered versions middle&tail available too | 1 | 75 GB : 42 % |
EU Book Shop | 2.3 GB / 2.3b | + | 5 | 11.5 GB: 6.5 % | ||
News 2018 | 4.3 GB / 4.3b | + | 5 | 20 GB: 11 % | ||
Wortschatz Uni Leipzig | > 20 * 200 mb | Part of News 2018??? | <a href ='https://wortschatz.uni-leipzig.de/de/download/german'>Code</a> | ---- | ---- | |
Paracrawl | 3.2 GB / 3.2b | -- | --- | ---- | ||
Open Subtitles | 1.3 GB / 288m Tokens | o | 2 | 2.6 GB : 1.5 % | ||
Open Legal Dump | 3.6 GB / 3.5b | + | <a href ='http://openlegaldata.io/research/2019/02/19/court-decision-dataset.html'>Announcment</a> | Used by Deepset | 5 | 15 GB: 8.4 % |
Corpus of German-Language Fiction (txt) | 2735 Prose Works | <a href ='https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1'>Download</a> | Old (1510-1940) |
https://ofai.github.io/million-post-corpus/
Additional Sources
- Originally meant for translations tasks: <a href ='http://www.statmt.org/wmt19/translation-task.html#download'>WMT 19</a>
- Mabe Identical to News 2018??? <a href ='https://datasetsearch.research.google.com/search?query=german&docid=37NTDqMDLv%2BKtj8QAAAAAA%3D%3D'>Leipzig Corpus Collection</a>
- <a href ='https://www.ims.uni-stuttgart.de/en/research/resources/corpora/hgc/'>Huge German Corpus (HGC)</a>
Data Preperation
- Clean Files
- Split in distinct sentences (2.2 Create Vocab)
- Tokenize
2. Training
Training
- Pre-training SmallBERTa - A tiny model to train on a tiny dataset: https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b
- Pretraining RoBERTa using your own data(Fairseq): https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.pretraining.md
- How to train a new language model from scratch using Transformers and Tokenizers: https://huggingface.co/blog/how-to-train
- Language model training: https://github.com/huggingface/transformers/tree/master/examples/language-modeling
- How to train a new language model from scratch using Transformers and Tokenizers: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb
NLP Libs
- Fairseq - GitHub
- Hugging Face - GitHub
- FARM - GitHub
- DeepSpeed: Speeding Up BERT Training @ Microsoft <a href ='https://github.com/microsoft/DeepSpeed'>Github</a>
Training Runs from scratch
Name | Steps | Result URL | Training Time | Code | Paper |
---|---|---|---|---|---|
RoBERTa Base | <a href ='https://arxiv.org/pdf/1907.11692.pdf'>RoBERTa</a> | ||||
BERT Large | <a href='https://arxiv.org/pdf/1907.11692.pdf'>Github</a> | <a href ='https://arxiv.org/abs/1810.04805'>BERT</a> |
TPU Infos
- Overview preemtible TPUs <a href ='https://github.com/shawwn/tpunicorn'>TPU Unicorn</a>
3. Evaluation Metrics
Comparison to other German & Multilingual Models
Name | Steps | Result URL | Training Time | Code | Metrics |
---|---|---|---|---|---|
Deepset German Bert Base | 810k (1024 SL) + 30k (512 SL) | <a href ='https://deepset.ai/german-bert'>Deepset</a> | 9 Days TPU v2-8 | ||
Ddmdz German Bert Base | 1500k (512 SL) | <a href ='https://github.com/dbmdz/berts#german-bert'>dbmdz</a> | <a href ='https://github.com/dbmdz/berts#german-bert'>dbmdz</a> | <a href ='https://github.com/stefan-it/fine-tuned-berts-seq#german'>stefan-it</a> | |
Europeana BERT | <a href ='https://github.com/dbmdz/berts#german-europeana-bert'>dbmdz</a> | <a href ='https://github.com/dbmdz/berts#german-europeana-bert'>Europeana-bert</a> |