Home

Awesome

Stemmers for Ukrainian

This repository introduces a new stemmer for the Ukrainian language (tree_stem) created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms.

The proposed algorithm does not use dictionary lookups while maintaining a reasonably small size (48 KB of Python bytecode). It works faster than lemmatization approach by a factor of x24, and outperforms other stemming algorithms in speed as well.

In addition to the new algorithm, this repository also contains Python ports of some of the previously published stemmers.

Comparison of stemmers for the Ukrainian language

StemmerLanguagesUIOIERRT
Dictionary-based (reference)0.01723.59e-060.0244
tree_stemPython0.09072.71e-06<ins>0.125</ins>
pymorphy2 (Paper)Python0.3242.01e-070.391
stemkaC++0.3292.34e-060.447
tapkometSnowball, C, Java0.4472.73e-060.603
vgrichinaGroovy, Python0.4971.05e-060.651
drupalJS, Python0.5117.54e-070.666
tochytskyi (Paper)PHP, Python0.6235.72e-070.795
No stemming1.001.69e-08

where:

Notes:

References