Awesome
We're in the process of releasing BERT models as well. Get the first one here: https://github.com/mollerhoj/danish_bert
Scandinavian ULMFiT
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch.
This repository contains the weights for the embedding layer of a UMLFiT language model that can be used as the first step in fine-tuning any Natural Language Processing task.
The weights were trained on 90% of all text in the corresponding language wikipedia as per 3. July 2018. The remaining 10% was used for validation.
Supported Languages:
- Danish
Trained on 78,373,122 tokens, and validated on 7,837,310 tokens. We achieve a perplexity of 30.9. Download files: Link
- Norwegian
Trained on 80,284,231 tokens, and validated on 8,920,387 tokens. We achieve a perplexity of 26.31. Download files: Link
- Finnish
Trained on 68,775,370 tokens, and validated on 7,641,571 tokens. We achieve a perplexity of 27.66
Training even higher performance models is possible, but require more (costly) training time. If you need a model with higher performance, feel free to contact us. Download files: Link
Our servers crashed when training the Swedish model, but if you're in need of it, contact us and we can train it for you.
Paper
See Universal Language Model Fine-tuning for Text Classification, Jeremy Howard, Sebastian Ruder, https://arxiv.org/abs/1801.06146
File descriptions
-
enc.h5 Contains the weights in 'Hierarchical Data Format'
-
enc.pth Contains the weights in 'Pytorch model format'
-
itos.pkl (Integers to Strings) contains the vocabulary mapping from ids (0 - 30000) to strings
Sponsor
This work was sponsored by Danish chatbot company BotXO http://www.botxo.co/
Thanks
Thanks to Tobias Lindberg from Damvad Analytics for converting the vectors to pth-format.