Awesome

Benchmarking Finnish POS taggers and lemmatizers

This repository contains an evaluation of the accuracy of open source Finnish part-of-speech taggers and lemmatization algorihtms.

Tested algorithms

spaCy 3.3.0
Experimental Finnish model for spaCy 0.10.0
FinnPos git commit 81c1f735 (Oct 2019)
Simplemma 0.6.0
Stanza 1.4.0
Trankit 1.1.1
Turku neural parser pipeline git commit 8c9425dd (Jan 2022)
UDPipe (through spacy-udpipe 1.0.0)
UralicNLP 1.3.0
libvoikko 4.3.1 and Python voikko module 0.5
Raudikko git commit 572b8104 (Jan 2022)

Test datasets

FinnTreeBank 1 v1: randomly sampled subset of about 1000 sentences
FinnTreeBank 2: news, Sofie and Wikipedia subsets
UD_Finnish-TDT r2.9: the testset

Setup

Install dependencies:

Python 3.9
libvoikko with Finnish morphology data files
clang (or other C++ compiler)
Dependencies needed to compile FinnPos and cg3
Java 11

Setup git submodules, create a Python 3.9 (must be 3.9 because the Turku parser is incompatible with more recent Python versions) virtual environment and download test data and models by running the following commands:

git submodule init
git submodule update

python3.9 -m venv venv
source venv/bin/activate
pip install wheel
pip install -r requirements.txt

# Compile FinnPos
(cd models/FinnPos/src && make -j 4)

# Compile cg3 in models/cg3
# See https://visl.sdu.dk/cg3/chunked/installation.html

# Compile Raudikko
(cd models/raudikko && ./gradlew shadowJar)

./download_data.sh
./download_models.sh

Run

./run.sh

The numerical results will be saved in results/evaluation.csv, POS and lemma errors made by each model will be saved in results/errorcases, and plots will be saved in results/images.

Results

Lemmatization

Lemmatization speed

Execution duration as a function of the F1 score on the concatenated data. Larger values are better on both axes. Notice that the Y-axis is on log scale.

The execution duration is measured as a batched evaluation (a batch contains all sentences from one dataset) on a 4 core CPU. Some methods can be run on a GPU which most likely would improve their performance, but I haven't tested that.

Lemmatization error rates

Lemmatization F1 scores for the benchmarked algorithms on the test datasets.

Part-of-speech tagging

Part-of-speech speed

Execution duration as a function of the POS F1 score on the concatenated data.

Note that FinnPos and Voikko do not make a distinction between auxiliary and main verbs and therefore their performance suffers by 4-5% in this evaluation as they mislabel all AUX tags as VERBs.

Part-of-speech error rates

Part-of-speech F1 scores for the benchmarked algorithms.

Simplemma does not include a POS tagging feature.