Awesome
kea
kea is a simple rule-based tokenizer for French. The tokenization process is decomposed in two steps:
-
A rule-based tokenization approach is employed using the punctuation as an indication of token boundaries.
-
A large-coverage lexicon is used to merge over-tokenized units (e.g. fixed contractions such as aujourd'hui are considered as one token)
A typical usage of this module is:
import kea
sentence = "Le Kea est le seul perroquet alpin au monde."
keatokenizer = kea.tokenizer()
tokens = keatokenizer.tokenize(sentence)
['Le', 'Kea', 'est', 'le', 'seul', 'perroquet', 'alpin', 'au', 'monde', '.']