Home

Awesome

2013-07-30

Prerequisites

Before compiling, you must have the following:

A C++ compiler and GNU make

Boost 1.47.0 or later http://www.boost.org

Eigen 3.1.x http://eigen.tuxfamily.org

Optional:

Intel MKL 11.x http://software.intel.com/en-us/intel-mkl Recommended for better performance.

Python 2.7.x, not 3.x http://python.org

Cython 0.19.x http://cython.org Needed only for building Python bindings.

Building

To compile, edit the Makefile to reflect the locations of the Boost and Eigen include directories.

If you want to use the Intel MKL library (recommended if you have it), uncomment the line MKL=/path/to/mkl editing it to point to the MKL root directory.

By default, multithreading using OpenMP is enabled. To turn it off, comment out the line OMP=1

Then run 'make install'. This creates several programs in the bin/ directory and a library lib/neuralLM.a.

Notes on particular configurations:

Training a language model

Building a language model requires some preprocessing. In addition to any preprocessing of your own (tokenization, lowercasing, mapping of digits, etc.), prepareNeuralLM (run with --help for options) does the following:

A typical invocation would be:

prepareNeuralLM --train_text mydata.txt --ngram_size 3 \
                --n_vocab 5000 --words_file words \
                --train_file train.ngrams \
                --validation_size 500 --validation_file validation.ngrams

which would generate the files train.ngrams, validation.ngrams, and words.

These files are fed into trainNeuralNetwork (run with --help for options). A typical invocation would be:

trainNeuralNetwork --train_file train.ngrams \
                   --validation_file validation.ngrams \
                   --num_epochs 10 \
                   --words_file words \
                   --model_prefix model

After each pass through the data, the trainer will print the log-likelihood of both the training data and validation data (higher is better) and generate a series of model files called model.1, model.2, and so on. You choose which model you want based on the validation log-likelihood.

You can find a working example in the example/ directory. The Makefile there generates a language model from a raw text file.

Notes:

Python code

prepareNeuralLM.py performs the same function as prepareNeuralLM, but in Python. This may be handy if you want to make modifications.

nplm.py is a pure Python module for reading and using language models created by trainNeuralNetwork. See testNeuralLM.py for example usage.

In src/python are Python bindings (using Cython) for the C++ code. To build them, run 'make python/nplm.so'.

Using in a decoder

To use the language model in a decoder, include neuralLM.h and link against neuralLM.a. This provides a class nplm::neuralLM, with the following methods:

void set_normalization(bool normalization);

Turn normalization on or off (default: off). If normalization is off, the probabilities output by the model will not be normalized. In general, this means that summing over all possible words will not give a probability of one. If normalization is on, computes exact probabilities (too slow to be recommended for decoding).

void set_map_digits(char c);

Map all digits (0-9) to the specified character. This should match whatever mapping you used during preprocessing.

void set_log_base(double base);

Set the base of the log-probabilities returned by lookup_ngram. The default is e (natural log), whereas most other language modeling toolkits use base 10.

void read(const string &filename);

Read model from file.

int get_order();

Return the order of the language model.

int lookup_word(const string &word);

Map a word to an index for use with lookup_ngram().

double lookup_ngram(const vector<int> &ngram);
double lookup_ngram(const int *ngram, int n);

Look up the log-probability of ngram.

End.