Home

Awesome

flair-lms

This repository is part of the NLP research with flair, a state-of-the-art NLP framework from Zalando Research.

This repository will include various language models (forward and backward) that can be used with flair. It will be updated frequently. So please star or watch this repository 😅

Changelog

January 2020: Move repository to the new FlairNLP group on GitHub.

September 2019: New Multilingual Flair Embeddings trained on JW300 corpus are released.

September 2019: All Flair Embeddings that are now officially available in Flair >= 0.4.3 are listed.

Parameters

All Flair Embeddings are trained with a hidden_size of 2048 and nlayers of 1.

Flair Embeddings

Language model# TokensForward ppl.Backward ppl.Flair Embeddings alias
Arabic736M3.393.45ar-forward and ar-backward
Bulgarian (fast)66M2.482.51bg-forward-fast and bg-backward-fast
Bulgarian111M2.462.47bg-forward and bg-backward
Czech (v0)778M3.443.48cs-v0-forward and cs-v0-backward
Czech442M2.892.90cs-forward and cs-backward
Danish325M2.622.68da-forward and da-backward
Basque (v0)37M2.562.58eu-v0-forward and eu-v0-backward
Basque (v1)37M2.642.31eu-v1-forward and eu-v1-backward
Basque57M2.902.83eu-forward and eu-backward
Persian146M3.683.66fa-forward and fa-backward
Finnish427M2.632.65fi-forward and fi-backward
Hebrew502M3.843.87he-forward and he-backward
Hindi28M2.872.86hi-forward and hi-backward
Croatian625M3.133.20hr-forward and hr-backward
Indonesian174M2.802.74id-forward and id-backward
Italian1,5B2.622.63it-forward and it-backward
Dutch (v0)897M2.782.77nl-v0-forward and nl-v0-backward
Dutch1,2B2.432.55nl-forward and nl-backward
Norwegian156M3.013.01no-forward and no-backward
Polish1,4B2.952.84pl-opus-forward and pl-opus-backward
Slovenian (v0)314M3.283.34sl-v0-forward and sl-v0-backward
Slovenian419M2.882.91sl-forward and sl-backward
Swedish (v0)545M2.292.27sv-v0-forward and sv-v0-backward
Swedish671M6.82 (?)2.25sv-forward and sv-backward
Tamil18M2.234509 (!)ta-forward and ta-backward

Multilingual Flair Embeddings

Multilingual Flair Embeddings were trained on the recently released JW300 corpus. Thanks to half precision support in Flair, both forward and backward Embeddings were trained for 5 epochs for over 10 days. The training corpus has 2,025,826,977 token.

Language model# TokensForward ppl.Backward ppl.Flair Embeddings alias
JW3002B3.253.37multi-forward and multi-backward

It can be loaded with:

from flair.embeddings import FlairEmbeddings

jw_forward = FlairEmbeddings("multi-forward")
jw_backward = FlairEmbeddings("multi-backward")

A detailed evaluation on various PoS tagging tasks can be found in this repository.

We would like to thank Željko Agić for providing us access to the corpus (before it was officially released)!