Awesome

PoLitBert - Polish RoBERTa model

Polish RoBERTa model trained on Polish Wikipedia, Polish literature and Oscar. Major assumption is that good quality text will give good model.

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Experiments setup and goals
Data
- Data processing for training
- Training, testing dataset stats
Training Polish RoBERTa protocol with Fairseq
Pretrained models and vocabs
- KLEJ evaluation
- Details of models training
Used libraries
Acknowledgements
About Ermlab Software

Experiments setup and goals

During experiments, we want to examine:

impact of different learning schedulers for training speed and accuracy, tested:
- linear schedule with warmup
- cyclic schedule: cosine, triangular
impact of training time on final accuracy

Data

Polish Wikipedia dump 03.2020 - archive link https://dumps.wikimedia.org/plwiki/20200301 (not working anymore)
Polish private book corpus (6GB)
Cleaned Polish Oscar corpus (remove non-polish sentences, keep only valid sentences etc.)(Cleaned Polish Oscar details)

Data processing for training

Our main assumption is that good quality text should produce good language model. So far the most popular polish dataset was "Polish wikipedia dump" however this text characterize with formal language. Second source of text is polish part of Oscar corpus - crawled text from the polish internet. When we investigate this corpus with more details it appears that it contains a lot of: foreign sentences (in Russian, English, German etc.), too short sentences and not grammatical sentences (as words enumerations).

We prepared a few cleaning heuristics:

remove sentences shorter than
remove non polish sentences
remove ungrammatical sentences (without verbs and with too many nouns)
perform sentence tokenization and save each sentence in new line, after each document the new line was added

Data was cleaned with use of process_sentences.py script, the whole process is presented in the polish_process_data.ipynb notebook.

Polish Wikipedia dump (03.2020)
- corpus_wikipedia_2020-03-01_all_lines.zip (0.58 GB)
Cleaned Polish Oscar corpus

Summary of Cleaned Polish Oscar corpus

File	All lines	All sentences	Invalid length sent.	Non-polish sent.	Ungrammatical sent.	Valid sentences
corpus_oscar_2020-04-10_32M_lines.txt	32 000 506	94 332 394	1 796 371	296 093	8 100 750	84 139 180
corpus_oscar_2020-04-10_64M_lines.txt	32 000 560	96 614 563	1 777 586	491 789	7 869 507	86 475 681
corpus_oscar_2020-04-10_96M_lines.txt	32 001 738	96 457 553	1 796 083	302 598	7 908 090	86 450 782
corpus_oscar_2020-04-10_128M_lines.txt	32 002 212	97 761 040	1 919 071	305 924	7 891 846	87 644 199
corpus_oscar_2020-04-10_128M_above_lines.txt	17 519 467	53 446 884	1 090 714	212 657	4 343 296	47 800 217

Training, testing dataset stats

Train Corpus	Lines	Words	Characters
Polish Wikipedia (2020-03)	11 748 343	181 560 313	1 309 416 493
Books	81 140 395	829 404 801	5 386 053 287
Oscar (32M part, cleared)	112 466 497	1 198 735 834	8 454 177 161
Total	205 355 235	2 209 700 948	15 149 646 941

For testing we take ~10% of each corpus

Test Corpus	Lines	Words	Characters
Polish Wikipedia (2020-03)	1 305 207	21 333 280	155 403 453
Books	9 007 716	93 141 853	610 111 989
Oscar (32M part, cleared)	14 515 735	157 303 490	1 104 855 397
Total	24 828 658	271 778 623	1 870 370 839

Training Polish RoBERTA protocol with Fairseq

General recipe of the final data preparation and model training process:

Prepare huge text file data.txt e.g. Wikipedia text, where each sentence is in a new line and each article is separated by two new lines.
Take 10-15M lines and prepare another file for sentencepiece (vocabulary builder) - again, each sentence is in one line.
Train sentencepiece vocabulary and save it in fairseq format vocab.fairseq.txt.
Encode data.txt with trained sentencepiece model to data.sp.txt.
Preprocess data.sp.txt with fairseq-preprocess.
Run training.

Detailed data preparation steps for fairseq (vocab gen and binarization) are available in separate notebook polish_roberta_vocab.ipynb.

Commands needed to reproduce fairseq models with various training protocols may be found in polish_roberta_training.ipynb.

Pretrained models and vocabs

KLEJ evaluation

All models were evaluated at 26.07.2020 with 9 KLEJ benchmark tasks . Below results were achieved with use of fine-tuning scripts from Polish RoBERTa without any further tweaks. which suggests that the potential of the models may not been fully utilized yet.

Model	NKJP-NER	CDSC-E	CDSC-R	CBD	PolEmo2.0-IN	PolEmo2.0-OUT	DYK	PSC	AR	Avg
PoLitBert_v32k_linear_50k	92.3	91.5	92.2	64	89.8	76.1	60.2	97.9	87.6	83.51
PoLitBert_v32k_linear_50k_2ep	91.9	91.8	90.9	64.6	89.1	75.9	59.8	97.9	87.9	83.31
PoLitBert_v32k_tri_125k	93.6	91.7	91.8	62.4	90.3	75.7	59	97.4	87.2	83.23
PoLitBert_v32k_linear_125k_2ep	94.3	92.1	92.8	64	90.6	79.1	51.7	94.1	88.7	83.04
PoLitBert_v32k_tri_50k	93.9	91.7	92.1	57.6	88.8	77.9	56.6	96.5	87.7	82.53
PoLitBert_v32k_linear_125k	94	91.3	91.8	61.1	90.4	78.1	50.8	95.8	88.2	82.39
PoLitBert_v50k_linear_50k	92.8	92.3	91.7	57.7	90.3	80.6	42.2	97.4	88.5	81.50
PoLitBert_v32k_cos1_2_50k	92.5	91.6	90.7	60.1	89.5	73.5	49.1	95.2	87.5	81.08
PoLitBert_v32k_cos1_5_50k	93.2	90.7	89.5	51.7	89.5	74.3	49.1	97.1	87.5	80.29

A comparison with other developed models is available in the continuously updated leaderboard of evaluation tasks.

Details of models training

We believe in open science and knowledge sharing, thus we decided to share complete code, params, experiment details and tensorboards.

Link to PoLitBert research log (same as below).

Experiment	Model name	Vocab size	Scheduler	BSZ	WPB	Steps	Train tokens	Train loss	Valid loss	Best (test) loss
#1	PoLitBert_v32k_linear_50k (tensorboard)	32k	linear decay	8 192	4,07E+06	50 000	2,03E+11	1,502	1,460	1,422
#2	PoLitBert_v32k_tri_50k (tensorboard)	32k	triangular	8 192	4,07E+06	50 000	2,03E+11	1,473	1,436	1,402
#3	PoLitBert_v32k_cos1_50k (tensorboard)	32k	cosine mul=1	8 192	4,07E+06	23 030	9,37E+10	10,930	11,000	1,832
#4	PoLitBert_v32k_cos1_2_50k (tensorboard)	32k	cosine mul=1 peak=0.0005	8 192	4,07E+06	50 000	2,03E+11	1,684	1,633	1,595
#5	PoLitBert_v32k_cos1_3_50k (tensorboard)	32k	cosine mul=2	8 192	4,07E+06	3 735	1,52E+10	10,930
#6	PoLitBert_v32k_cos1_4_50k (tensorboard)	32k	cosine mul=2 grad-clip=0.9	8 192	4,07E+06	4 954	2,02E+10	10,910	10,940	2,470
#8	PoLitBert_v32k_tri_125k (tensorboard)	32k	triangular	8 192	4,07E+06	125 000	5,09E+11	1,435	1,313	1,363
#9	PoLitBert_v32k_cos1_5_50k (tensorboard)	32k	cosine, mul=2, grad-clip=0.9	8 192	4,07E+06	125 000	5,09E+11	1,502	1,358	1,426
#10	PoLitBert_v32k_linear_125k (tensorboard)	32k	linear decay	8 192	4,07E+06	125 000	5,09E+11	1,322	1,218	1,268
#11	PoLitBert_v50k_linear_50k (tensorboard)	50k	linear decay	8 192	4,07E+06	50 000	2,04E+11	1,546	1,439	1,480

Used libraries

KRNNT - Polish morphological tagger. - we use dockerized version
langdetect - for detecting sentence language
polyglot - for detecting sentence language
sentencepiece
Fairseq v0.9

Instalation dependecies and problems

langdetect needs additional package
- install sudo apt-get install libicu-dev
sentencepiece was installed from source code

Acknowledgements

This is the joint work of companies Ermlab Software and Literacka

Part of the work was financed from the grant of The Polish National Centre for Research and Development no. POIR.01.01.01-00-1213/19, the beneficiary of which was Literacka. Project title "Asystent wydawniczy - oprogramowanie do analizy treści, wykorzystujące algorytmy sztucznej inteligencji w celu zautomatyzowania procesu wydawniczego i predykcji sukcesów rynkowych publikacji."

We would like to express ours gratitude to NVidia Inception Programme and Amazon AWS for providing the free GPU credits - thank you!

Authors:

Also appreciate the help from

simonefrancia from Musixmatch for his detailed explanations how they trained RoBERTa Italian model Umberto

About Ermlab Software

Ermlab - Polish machine learning company

:owl: Website | :octocat: Repository

<img src="/images/ermlab_software.png" width="800">.