Home

Awesome

Colibri Utils

This collection of command-line Natural Language Processing utilities currently contains only a single tool:

Installation

Colibri Utils is included in our LaMachine distribution, which is the easiest and recommended way of obtaining it.

Colibri Utils is written in C++. Building from source is also possible if you have the expertise, but requires various dependencies, including ticcutils, libfolia, and colibri-core which all have to be obtained and compiled separately.

$ bash bootstrap.sh
$ ./configure
$ make
$ sudo make install

Usage

See colibri-lang --help

Methodology

To identify languages, input tokens are matches against a trained lexicon with token frequencies, which are loaded in memory. No higher order n-grams are used.

A pseudo-probability is computed for the given sequence of input tokens for each language, the highest probability wins. A confidence value is computed simply as the ratio of tokens in the vocabulary divided by the length of the token sequence. Out of vocabulary words are assigned a very low probability.

Training

New models can easily be trained and added, and are independent of the other models. Simply train an unindexed patternmodel with Colibri Core and put the model file and the class file in your data directory. Ensure the data is tokenised and lower-cased prior to building a pattern model (ucto can do both of this for you). A full example:

$ ucto -n -l -Lgeneric corpus.txt corpus.tok.txt
$ colibri-classencode corpus.tok.txt
$ colibri-patternmodeller -u -t 5 -l 1 -f corpus.tok.colibri.dat -o corpus.colibri.model
$ mv corpus.tok.colibri.cls corpus.colibri.cls
$ sudo cp corpus.colibri.* /usr/local/share/colibri-utils/data/