Home

Awesome

Norwegian Transformer Model

The project "NoTraM - Norwegian Transformer Model" is owned by the National Library of Norway.

Project Goal

Norwegian Colossal Corpus

The Norwegian Colossal Corpus is an open text corpus comparable in size and quality with available datasets for English.

The core of the corpus is based on a unique project started in 2006. In the digitalisation project the goal has veeb to digitize and store all content ever published in Norwegian. In addition we have added multiple other public sources of Norwegian text. Details about the sources as well as how they are built are available in the Colossal Norwegian Corpus Description.

CorpusLicenseSizeWordsDocumentsAvg words per doc
Library NewspapersCC0 1.014.0 GB2,019,172,62510,096,424199
Library BooksCC0 1.06.2 GB861,465,90724,25335,519
LovData CDNLOD 2.00.4 GB54,923,43251,9201,057
Government ReportsNLOD 2.01.1 GB155,318,7544,64833,416
Parliament CollectionsNLOD 2.08.0 GB1,301,766,1249,528136,625
Public ReportsNLOD 2.00.5 GB80,064,3963,36523,793
Målfrid CollectionNLOD 2.014.0 GB1,905,481,7766,735,367282
Newspapers OnlineCC BY-NC 2.03.7 GB541,481,9473,695,943146
WikipediaCC BY-SA 3.01.0 GB140,992,663681,973206

The easiest way to access the corpus is to download from HuggingFace. This site explains in details how the corpus can be used. It also gives an extensive information about the content of the corpus, as well as how to filter out certain part of the corpus and how it can be combined with other Norwegian datasets like MC4 and OSCAR.

In addition to the corpus itself we do provide a set of scripts for creating and cleaning corpus files. We also provide a guide where you can follow us in creating a corpus for your data sources step-by-Step Guide about how to create corpus file, and a description about how to create and upload a HuggingFace dataset. Other tools and guides can also be found on our Guides Page. We have made all our software available for anyone to use. Most of it is written in python 3.

Pretrained Models

The following pretrained models are available. These models have to be finetuned on a specific task. The finetuning is straight forward if you have a dataset available. Please take a look at the Colabs below for sample code. Often you will only need to change a couple of lines of code to adapt it to your task.

NameDescriptionModel
nb‑bert‑baseThe original model based on the same structure as BERT Cased multilingual model. Even if it is trained mainly on Norwegian text, it does also maintain some of the multilingual capabilities. Especially it has good scores on Swedish, Danish and English.🤗 Model
nb‑bert‑largeThe model is based on the BERT-large-uncased architecture. For classification tasks, this model will give the best results. Since it is uncased it might not give as good results on NER-tasks. It might require more processing power both for finetuning and for inference.🤗 Model

Finetuned Models

These models are finetuned on a specific task, and can be used directly.

NameDescriptionModel
nb‑bert‑base‑mnliThe nb-bert-base-model finetuned on the mnli task. See model page for more details.🤗 Model
saattrupdan/nbailab‑nb‑basenb‑nernb‑scandiThis NER model is trained by Dan Saatrup on top of our nb-bert-base. It has been fine-tuned on the concatenation of DaNE, NorNE, SUC 3.0 and the Icelandic and Faroese parts of the WikiANN dataset. The model yields better results on Norwegian NER tasks than the models only finetuned on Norwegian. See model page for more details.🤗 Model

Results

The NB-BERT-Base model is thoroughly tested in the article cited below. Here are some of our results:

TaskmBERT-baseNB-BERT-base
POS - NorNE - Bokmål98.3298.86
POS - NorNE - Nynorsk98.0898.77
NER - NorNE - Bokmål81.7590.03
NER - NorNE - Nynorsk84.6987.67
Classification - ToN - Frp/SV73.7577.49
Sentence-level binary sentiment classification73.2784.04

Colab Notebooks

The original models need to be fine-tuned for the target task. A typical task is classification, and it is then recommeded that you train a top fully connected layer for this specific task. The following notebooks will allow you to both test the model, and to train your own specialised model on top of our model. Especially the notebook about classification models that trains a sentiment classification task, can very easily be adapted to training any NLP classification task.

TaskColaboratory Notebook
How to use the model for masked layer predictions (easy)<a href="https://colab.research.google.com/gist/peregilk/f3054305cfcbefb40f72ea405b031438/nbailab-masked-layer-pipeline-example.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
How to use finetuned MNLI-version for zero-shot-classification (easy)<a href="https://colab.research.google.com/gist/peregilk/769b5150a2f807219ab8f15dd11ea449/nbailab-mnli-norwegian-demo.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
How to finetune a classification model (advanced)<a href="https://colab.research.google.com/gist/peregilk/3c5e838f365ab76523ba82ac595e2fcc/nbailab-finetuning-and-evaluating-a-bert-model-for-classification.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
How to finetune a NER/POS-model (advanced)<a href="https://colab.research.google.com/gist/peregilk/6f5efea432e88199f5d68a150cef237f/-nbailab-finetuning-and-evaluating-a-bert-model-for-ner-and-pos.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Disclaimer

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions. When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of artificial intelligence. In no event shall the owner of the models (The National Library of Norway) be liable for any results arising from the use made by third parties of these models.

Citation

If you use our models or our corpus, please cite our article:

@inproceedings{kummervold-etal-2021-operationalizing,
title = {Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model},
author = {Kummervold, Per E  and
  De la Rosa, Javier  and
  Wetjen, Freddy  and
  Brygfjeld, Svein Arne},
booktitle = {Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)},
year = {2021},
address = {Reykjavik, Iceland (Online)},
publisher = {Link{\"o}ping University Electronic Press, Sweden},
url = {https://aclanthology.org/2021.nodalida-main.3},
pages = {20--29},
abstract = {In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{\aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.},
}