Home

Awesome

Norwegian model for spaCy

This project's goal is to develop a Norwegian language model for Python library spaCy.

About the data

The model was based on an unpublished dataset developed by Nasjonalbiblioteket, Schibsted and Language Technology Group at UiO (to be published autumn 2018) in .conllu format and following Universal Dependencies annotation. It is similar to this treebank: https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal only it contains NER tags in the 10th column of each token line and interprets some parts of speech differently (mostly pronouns/determiners).

How to train the model

First of all, I converted the .conllu files (train, dev and test) to .json, because that's the format spaCy supports:

python -m spacy convert path/to/train-file.conllu path/to/output/directory -m
python -m spacy convert path/to/dev-file.conllu path/to/output/directory -m
python -m spacy convert path/to/test-file.conllu path/to/output/directory -m

I chose the '-m' option to be able to append morphological features to POS tags, so I end up with a tag looking like this: 'NOUN__Definite=Ind|Gender=Neut|Number=Sing' instead of just plain 'NOUN'.

To train the model I run the command:

python -m spacy train nb path/to/output/directory path/to/train-file.json path/to/dev-file.json -n 10

This gave me the following results:

Itn.P.LossN.LossUASNER P.NER R.NER F.Tag %Token %
017237.6451388.18384.01475.98574.98575.48293.674100.0005044.2
1423.72113.93286.05378.79578.27678.53594.655100.0005988.9
2350.0199.43486.92277.96477.91777.94195.026100.0006096.9
3306.0367.81787.33379.47678.09778.78195.146100.0005308.6
4272.1606.33387.57980.03679.17479.60395.198100.0005649.8
5247.5115.72987.84578.03078.21778.12395.292100.0005796.3
6226.6255.19087.82478.65278.93578.79395.363100.0005884.3
7208.4184.36487.85278.48177.91778.19895.308100.0005853.0
8191.1163.98588.13277.71877.85877.78895.212100.0005856.2
9178.4843.45488.12978.79777.61878.20347.975100.0005853.8

The best model seems to be model 6, so I turn it into a package for convenience:

python -m spacy package path/to/model path/to/output/directory

Before I do that I manually fill out information in file meta.json that lies in path/to/model (name, version, description etc). The script will throw an exception if those fields are empty.

In order to install the package I now go to the package's directory and run:

python setup.py sdist

and

pip install /path/to/dist/name-of-model.tar.gz

Loading and using the model

I can now load the model from Python shell with:

spacy.load('name-of-model')

or evaluate it:

python -m spacy evaluate name-of-model path/to/test-file.json

which gives me the folllowing result:

Time               5.00 s         
Words              30034          
Words/s            6010           
TOK                100.00         
POS                94.40          
UAS                87.28          
LAS                84.46          
NER P              70.85          
NER R              70.34          
NER F              70.59   

Custom EntityMatcher

I also added a custom pipeline component, EntityMatcher. Its code is based on this solution: https://support.prodi.gy/t/adding-custom-attribute-to-doc-having-ner-use-attribute/356/6

In order for it to work I had to add entity_matcher to the model's meta.json pipeline and to the language's factory (in __init__.py). The files that the component extracts data from are placed in a separate folder entity_matcher and their names contain one of the 4 labels used: PER, LOC, ORG or MISC.

EntityMatcher extracts data from various text files and labels them with an entity tag (PER, LOC, ORG or MISC). It enforces 'PROPN___' tag on all the entity tokens too (since all named entities are proper nouns). One can also check if the entity label comes from custom EntityMatcher or built-in EntityRecognizer by using the 'via_patterns' extension, which returns True in the first case and False in the other.

Extending the model with custom entity labels

To add entities with a custom label check the guide here (using Matcher/PhraseMatcher that doesn't require context/sentences): https://spacy.io/usage/linguistic-features#adding-phrase-patterns

, especially the section Adding on_match rules and examples. Or here if your goal is to train EntityRecognizer with custom labels (requires whole sentences tagged with named entity label): https://spacy.io/usage/examples#section-training

Examples

Examples of use can be found in spacy_examples.py