Home

Awesome

PoLitBert - Polish RoBERTa model

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Table of Contents

Experiments setup and goals

During experiments, we want to examine:

Data

Data processing for training

Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).

We prepared a few cleaning heuristics:

Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.

Summary of Cleaned Polish Oscar corpus

FileAll linesAll sentencesInvalid length sent.Non-polish sent.Ungrammatical sent.Valid sentences
corpus_oscar_2020-04-10_32M_lines.txt32 000 50694 332 3941 796 371296 0938 100 75084 139 180
corpus_oscar_2020-04-10_64M_lines.txt32 000 56096 614 5631 777 586491 7897 869 50786 475 681
corpus_oscar_2020-04-10_96M_lines.txt32 001 73896 457 5531 796 083302 5987 908 09086 450 782
corpus_oscar_2020-04-10_128M_lines.txt32 002 21297 761 0401 919 071305 9247 891 84687 644 199
corpus_oscar_2020-04-10_128M_above_lines.txt17 519 46753 446 884  1 090 714212 6574 343 29647 800 217

Training, testing dataset stats

Train CorpusLinesWordsCharacters
Polish Wikipedia (2020-03)11 748 343181 560 3131 309 416 493
Books81 140 395829 404 8015 386 053 287
Oscar (32M part, cleared)112 466 4971 198 735 8348 454 177 161
Total205 355 2352 209 700 94815 149 646 941

For testing we take ~10% of each corpus

Test CorpusLinesWordsCharacters
Polish Wikipedia (2020-03)1 305 20721 333 280155 403 453
Books9 007 71693 141 853610 111 989
Oscar (32M part, cleared)14 515 735157 303 4901 104 855 397
Total24 828 658271 778 6231 870 370 839

Training Polish RoBERTA protocol with Fairseq

General recipe of the final data preparation and model training process:

  1. Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
  2. Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
  3. Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
  4. Encode data.txt with trained sentencepiece model to data.sp.txt.
  5. Preprocess data.sp.txt with fairseq-preprocess.
  6. Run training.

Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.

Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.

Pretrained models and vocabs

KLEJ evaluation

All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.

ModelNKJP-NERCDSC-ECDSC-RCBDPolEmo2.0-INPolEmo2.0-OUTDYKPSCARAvg
PoLitBert_v32k_linear_50k92.391.592.26489.876.160.297.987.683.51
PoLitBert_v32k_linear_50k_2ep91.991.890.964.689.175.959.897.987.983.31
PoLitBert_v32k_tri_125k93.691.791.862.490.375.75997.487.283.23
PoLitBert_v32k_linear_125k_2ep94.392.192.86490.679.151.794.188.783.04
PoLitBert_v32k_tri_50k93.991.792.157.688.877.956.696.587.782.53
PoLitBert_v32k_linear_125k9491.391.861.190.478.150.895.888.282.39
PoLitBert_v50k_linear_50k92.892.391.757.790.380.642.297.488.581.50
PoLitBert_v32k_cos1_2_50k92.591.690.760.189.573.549.195.287.581.08
PoLitBert_v32k_cos1_5_50k93.290.789.551.789.574.349.197.187.580.29

A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.

Details of models training

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Link to PoLitBert research log (same as below).

ExperimentModel nameVocab sizeSchedulerBSZWPBStepsTrain tokensTrain lossValid lossBest (test) loss
#1PoLitBert_v32k_linear_50k (tensorboard)32klinear decay8 1924,07E+0650 0002,03E+111,5021,4601,422
#2PoLitBert_v32k_tri_50k (tensorboard)32ktriangular8 1924,07E+0650 0002,03E+111,4731,4361,402
#3PoLitBert_v32k_cos1_50k (tensorboard)32kcosine mul=18 1924,07E+0623 0309,37E+1010,93011,0001,832
#4PoLitBert_v32k_cos1_2_50k (tensorboard)32kcosine mul=1 peak=0.00058 1924,07E+0650 0002,03E+111,6841,6331,595
#5PoLitBert_v32k_cos1_3_50k (tensorboard)32kcosine mul=28 1924,07E+063 7351,52E+1010,930
#6PoLitBert_v32k_cos1_4_50k (tensorboard)32kcosine mul=2 grad-clip=0.98 1924,07E+064 9542,02E+1010,91010,9402,470
#8PoLitBert_v32k_tri_125k (tensorboard)32ktriangular8 1924,07E+06125 0005,09E+111,4351,3131,363
#9PoLitBert_v32k_cos1_5_50k (tensorboard)32kcosine, mul=2, grad-clip=0.98 1924,07E+06125 0005,09E+111,5021,3581,426
#10PoLitBert_v32k_linear_125k (tensorboard)32klinear decay8 1924,07E+06125 0005,09E+111,3221,2181,268
#11PoLitBert_v50k_linear_50k (tensorboard)50klinear decay8 1924,07E+0650 0002,04E+111,5461,4391,480

Used libraries

Instalation dependecies and problems

Acknowledgements

This is the joint work of companies Ermlab Software and Literacka

Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."

We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!

Authors:

Also appreciate the help from

About Ermlab Software

Ermlab - Polish machine learning company

:owl: Website | :octocat: Repository

<img src="/images/ermlab_software.png" width="800">.