Awesome
Experimental Norwegian (Bokmål) language model for Spacy (Including NER)
Project for training a NER and DEP tagger for Norwegian Bokmål. This repository is not properly cleaned up, more will be done later.
Originally trained for Nudge AS and their product Tagbox.ai (http://tagbox.ai/):
Original dataset (source): https://github.com/ltgoslo/norne
Installation
To install nb_core_news_sm
package use this command:
pip install https://github.com/ohenrik/nb_news_ud_sm/raw/master/packaged_models/nb_core_sm_v2/nb_core_news_sm-1.0.0/dist/nb_core_news_sm-1.0.0.tar.gz
To install nb_ext_news_sm
package use this command:
pip install https://github.com/ohenrik/nb_news_ud_sm/raw/master/packaged_models/nb_core_sm_v3/nb_ext_news_sm-1.0.0/dist/nb_ext_news_sm-1.0.0.tar.gz
Usage
import spacy
nb_core = spacy.load("nb_core_news_sm")
nb_ext = spacy.load("nb_ext_news_sm")
doc = nb_core("Det er kaldt på vinteren i Norge.")
doc = nb_ext("Det er kaldt på vinteren i Norge.")
Test results
Core:
"accuracy":{
"uas":88.4345103245,
"las":85.7621102149,
"ents_p":84.9284928493,
"ents_r":85.3982300885,
"ents_f":85.1627137341,
"tags_acc":95.5524581855,
"token_acc":100.0
},
Extended:
"accuracy":{
"uas":88.3348622496,
"las":85.8077116563,
"ents_p":82.2999470058,
"ents_r":82.3872679045,
"ents_f":82.3435843054,
"tags_acc":95.7227138643,
"token_acc":100.0
},
Core and Extended models
In the folder packaged_models
there are two trained models. The first (v2
)
is trained on a simplified version of the original dataset, however the only
difference is that combined tags (mostly GPE_LOC) are converted to only GPE.
This model was named "core". This improved the test results from ≈0.83 to ≈0.85.
The second model V3
is trained on the original dataset.
This model was named "ext". and performs a bit worse than the core model (0.83)
Re splitt dataset
The original dataset produced models that did not perform well during training and produced test results that where widely different from the cross validated results found during training on the dev set.
After respliting the combined original dataset into training, dev and test, the model performed better and gave significantly better test results that also resembled the results achieved during training.
Conversion from conllu+bio files:
python -m spacy convert /path/to/project/original_data/no-ud-dev-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
python -m spacy convert /path/to/project/original_data/no-ud-test-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
python -m spacy convert /path/to/project/original_data/no-ud-train-ner.conllu /path/to/project/original_data/json_results --converter=conllubio -m
Training the entity and dependency parsing
python -m spacy train nb model_out2 ner_data/no-ud-train-ner.json ner_data/no-ud-dev-ner.json --use-gpu=0 -n 10
Completed packages
The package nb_core_news_sm-1.0.0
is based on model_out8/model14
and has
converted GPE_LOC and GPE_ORG etc to just GPE.
The package nb_ext_news_sm-1.0.0
is based on model_out10/model42
and
is based on the original dataset.
Cuda environment variables
export CUDA_HOME=/usr/local/cuda
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:$LD_LIBRARY_PATH