Home

Awesome

bleurt_eval

Evaluation of titles generated by Ukrainian model for GPT2 using BLEURT.

Experiment setup:

$ python -m bleurt.score_files   -candidate_file=candidate.small.txt   -reference_file=reference.small.txt  -bleurt_checkpoint=bleurt/BLEURT-20 -scores_file=scores.small.txt
$ python -m bleurt.score_files   -candidate_file=candidate.medium.txt   -reference_file=reference.medium.txt  -bleurt_checkpoint=bleurt/BLEURT-20 -scores_file=scores.medium.txt
$ python -m bleurt.score_files   -candidate_file=candidate.large.txt   -reference_file=reference.large.txt  -bleurt_checkpoint=bleurt/BLEURT-20 -scores_file=scores.large.txt

$ python -m bleurt.score_files   -candidate_file=candidate.mbart.1k.txt   -reference_file=reference.small.txt  -bleurt_checkpoint=bleurt/BLEURT-20 -scores_file=scores.mbart.1k.txt
$ python -m bleurt.score_files   -candidate_file=candidate.mbart.5k.txt   -reference_file=reference.small.txt  -bleurt_checkpoint=bleurt/BLEURT-20 -scores_file=scores.mbart.5k.txt

Combined files for the review available at bleurt.eval.small.csv, bleurt.eval.medium.csv and bleurt.eval.large.csv.

meanmedian
small0.54324378930032260.5374010503292084
medium0.56752715352922680.5721611678600311
large0.59069597224891190.5916261672973633
mbart.1k0.74083327332139020.8061753809452057
mbart.5k0.71351026919484140.7874233424663544

bert_score reported on the same data.

bert_score version 0.3.13 two multilang models:

xlm-roberta-large eval

$ time bert-score -r reference.small.txt -c candidate.small.txt -m xlm-roberta-large

xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.910866 R: 0.903273 F1: 0.906873

$ time bert-score -r reference.medium.txt -c candidate.medium.txt -m xlm-roberta-large

xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.915702 R: 0.906600 F1: 0.910944

$ time bert-score -r reference.large.txt -c candidate.large.txt -m xlm-roberta-large

xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.916548 R: 0.909308 F1: 0.912751

$ time bert-score -r reference.small.txt -c candidate.mbart.1k.txt -m xlm-roberta-large

xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.935879 R: 0.942202 F1: 0.938712

$ time bert-score -r reference.small.txt -c candidate.mbart.5k.txt -m xlm-roberta-large

xlm-roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.930015 R: 0.935764 F1: 0.932630

PRF1
small0.9108660.9032730.906873
medium0.9157020.9066000.910944
large0.9165480.9093080.912751
mbart.1k0.9358790.9422020.938712
mbart.5k0.9300150.9357640.932630

bert-base-multilingual-cased eval

$ time bert-score -r reference.small.txt -c candidate.small.txt -m bert-base-multilingual-cased

bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.780336 R: 0.766688 F1: 0.772851

$ time bert-score -r reference.medium.txt -c candidate.medium.txt -m bert-base-multilingual-cased

bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.790484 R: 0.775623 F1: 0.782396

$ time bert-score -r reference.large.txt -c candidate.large.txt -m bert-base-multilingual-cased

bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.792963 R: 0.780369 F1: 0.786095

$ time bert-score -r reference.small.txt -c candidate.mbart.1k.txt -m bert-base-multilingual-cased

bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.839130 R: 0.863765 F1: 0.850054

$ time bert-score -r reference.small.txt -c candidate.mbart.5k.txt -m bert-base-multilingual-cased

bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.26.1)_fast-tokenizer P: 0.819808 R: 0.847989 F1: 0.832732

PRF1
small0.7803360.7666880.772851
medium0.7904840.7756230.782396
large0.7929630.7803690.786095
mbart.1k0.8391300.8637650.850054
mbart.5k0.8198080.8479890.832732