Awesome

NLTK Slovenian POS tagger

This is a project that uses IJS JOS-1M corpus to train a part-of-speech tagger for Slovenian language.

Quick usage

POS tagger is available on PyPi with prebuilt dictionary. Installation:

pip install slopos

Usage:

import slopos

slopos.tag("Jaz sem iz okolice Ljubljane")

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.

Prepared files

The corpus was processed in several ways to prepare it for NLTK consumption. Partial files are part of this repository.

Original corpus

Original corpus is stored in multple split XML files, which are here stored in xml directory.
Partial text files

XML files have been processed and converted into a NLTK readable word/tag format using convert_xml_to_txt.py script. The processed files are stored in txt directory.
NLTK tagged corpus

Files from txt directory have been combined into a single file and stored in data/tagged_corpus directory for nltk-trainer consumption.

Training the POS tagger

POS tagger is trained using nltk-trainer project, which is included as a submodule in this project.

Install dependencies

virtualenv .
pip install -r requirements
pip install numpy
python nltk-trainer/setup.py install

Convert input files

python convert_xml_to_txt.py

Train

In top project directory run the trainer:

python nltk-trainer/train_tagger.py data/tagged_corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --filename slopos/sl-tagger.pickle

It'll take a short while and you should see output in form of

loading data/tagged_corpus
15758 tagged sents, training on 15758
training AffixTagger with affix -3 and backoff <DefaultTagger: tag=-None->
training <class 'nltk.tag.sequential.UnigramTagger'> tagger with backoff <AffixTagger: size=11492>
training <class 'nltk.tag.sequential.BigramTagger'> tagger with backoff <UnigramTagger: size=109127>
training <class 'nltk.tag.sequential.TrigramTagger'> tagger with backoff <BigramTagger: size=130795>
evaluating TrigramTagger
accuracy: 0.930942
creating directory out
dumping TrigramTagger to out/sl-tagger.pickle

The trained tagger will be deposited in out directory with name of sl-tagger.pickle.

Using the POS tagger

POS tagger is stored in form of Python pickle file after creation and you will need NLTK installed.

Usage:

import pickle
sl_tagger = pickle.load(open('out/sl-tagger.pickle', 'rb'))

sl_tagger.tag(["Jaz", "sem", "iz", "okolice", "Ljubljane"])

> [('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Note that punctionation should be stripped from words for proper detection. Tag reference is contained in tag_reference-sl.txt (slovenian) and tag_reference-en.txt files respectively.