Awesome
Colibri Utils
This collection of command-line Natural Language Processing utilities currently contains only a single tool:
- Colibri Lang: Language identification -
colibri-lang
- Detects in what language parts of a document are. Works on both FoLiA XML documents as well as plain text. When given FoLiA XML input, the document is enriched with language annotation and may be applied to any structural level FoLiA supports (e.g. paragraphs, sentences, etc..). When given plain text input, each input line is classified. This tool currently supports a limited subset of languages, but is easily extendable:- English, Spanish, Dutch, French, Portuguese, German, Italian, Swedish, Danish (trained on Europarl)
- Latin (trained on the Clementine Vulgate bible, and some on a few latin works from Project Gutenberg)
- Historical Dutch:
- Middle dutch - Trained on Corpus van Reenen/Mulder and Corpus Gysseling
- Early new dutch - Trained on Brieven als Buit
Installation
Colibri Utils is included in our LaMachine distribution, which is the easiest and recommended way of obtaining it.
Colibri Utils is written in C++. Building from source is also possible if you have the expertise, but requires various dependencies, including ticcutils, libfolia, and colibri-core which all have to be obtained and compiled separately.
$ bash bootstrap.sh
$ ./configure
$ make
$ sudo make install
Usage
See colibri-lang --help
Methodology
To identify languages, input tokens are matches against a trained lexicon with token frequencies, which are loaded in memory. No higher order n-grams are used.
A pseudo-probability is computed for the given sequence of input tokens for each language, the highest probability wins. A confidence value is computed simply as the ratio of tokens in the vocabulary divided by the length of the token sequence. Out of vocabulary words are assigned a very low probability.
Training
New models can easily be trained and added, and are independent of the other models. Simply train an unindexed patternmodel with Colibri Core and put the model file and the class file in your data directory. Ensure the data is tokenised and lower-cased prior to building a pattern model (ucto can do both of this for you). A full example:
$ ucto -n -l -Lgeneric corpus.txt corpus.tok.txt
$ colibri-classencode corpus.tok.txt
$ colibri-patternmodeller -u -t 5 -l 1 -f corpus.tok.colibri.dat -o corpus.colibri.model
$ mv corpus.tok.colibri.cls corpus.colibri.cls
$ sudo cp corpus.colibri.* /usr/local/share/colibri-utils/data/