Awesome

Welcome to Apache OpenNLP Models!

The Apache OpenNLP library provides binary models for processing of natural language text. This repository is intended for the distribution of model files as a Maven artifacts.

Useful Links

For additional information, visit the OpenNLP Home Page.

You can use OpenNLP with many languages. Additional demo models are provided here.

The models are fully compatible with the latest OpenNLP release. They can be used for testing or getting started.

[!NOTE]
Please train your own models for all other, specialized use cases.

Documentation, including JavaDocs, code usage and command-line interface examples are available here

You can also follow our mailing lists for news and updates.

Overview

We provide Tokenizer, Sentence Detector and Part-of-Speech Tagger models for the following 32 languages:

Armenian
Basque
Bulgarian
Catalan
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
Georgian
German
Greek
Icelandic
Italian
Kazakh
Korean
Latvian
Norwegian
Polish
Portuguese
Romanian
Russian
Serbian
Slovak
Slovenian
Spanish
Swedish
Turkish
Ukrainian

These models are compatible with OpenNLP >= 1.0.0. Further details are available at the OpenNLP Models page and in the CHANGELOG.

In addition, we provide a Language Detector, which is able to detect 103 languages in ISO 693-3 standard. Works well with longer texts that have at least 2 sentences or more from the same language.

It is compatible with OpenNLP >= 1.8.3. Model details are available here.

Getting Started

The Universal Dependencies (UD) community provides a framework for consistent annotation of grammar across different human languages. The project is developing cross-linguistically consistent treebank annotation for 150+ languages.

Referencing published Models

You can import UD-based model artifacts directly via Maven, SBT or Gradle, for instance:

Maven

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-models-pos-de</artifactId>
    <version>${opennlp.models.version}</version>
</dependency>

for all 32 supported languages, listed on the Apache OpenNLP Model page.

The broader langdetect model can be referenced like this:

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-models-langdetect</artifactId>
    <version>${opennlp.models.version}</version>
</dependency>

SBT

libraryDependencies += "org.apache.opennlp" % "opennlp-models-langdetect" % "${opennlp.version}"

Gradle

compile group: "org.apache.opennlp", name: "opennlp-models-langdetect", version: "${opennlp.version}"

For more details please check our documentation

Training Models

All released sentence detection, tokenization, lemmatizer, and POS tagging models were and can be trained via the ud-train.sh script. It is located in the opennlp-models-training-ud directory in this repository.

Preparing the environment

Before training UD-based OpenNLP models, the execution environment needs the latest OpenNLP release and the latest set of UD treebanks. Download the corresponding archive files and uncompress them both in the same directory in which the training script resides. Rename both folders according to the OPENNLP_HOME and UD_HOME variables.

[!IMPORTANT] Check and adjust the version string in both variables, that is, to the versions you have actually downloaded.

Selecting model types

Next, select what type of models should be trained. By default, the script defines:

TRAIN_TOKENIZER="true"
TRAIN_POSTAGGER="true"
TRAIN_SENTDETECT="true"
TRAIN_LEMMATIZER="true"

Simply switch off a certain type, by setting the corresponding variable to false.

Selecting languages

By default, treebanks of 32 supported languages are included in the MODELS variable of the script. If only a smaller or different (sub-)set is required, this variable can simply be edited. The format must be followed: <Language>|<2-digit-locale-code>|<UD treebank name>, for example: English|en|EWT or Swedish|sv|Talbanken.

[!NOTE] The full list of supported languages and related treebanks is available here. Yet, even listed on the UD page, training OpenNLP models might not succeed. If it succeeds, check the evaluation logs (*.eval) if the computed accuracy meets your expectations.

Adjusting training parameters

Once you're done with the preparations, check the ud-train.conf file. With this config file, you can adjust the number of threads used for certain training steps. Moreover, it is possible to adjust the number of iterations (default: 150) to achieve (slightly) better model performance.

Executing 'ud-train.sh'

Make sure to make the ud-train.sh script executable. On Unix-oid environments this can simply be achieved by setting the execute bit: chmod 744 ud-train.sh.

[!TIP] As model training(s) can be a long-running task, depending on CPU type and number of CPU cores, the script should be started inside a screen instance.

Finally, execute the script via invoking ./ud-train.sh and start brewing and enjoying some :coffee:.

The script logs each training (and evaluation) step per selected language / treebank, thus allowing progress tracking.

Evaluating trained Models

After a training step succeeds, a corresponding evaluation step is executed. If you want to skip it, set EVAL_AFTER_TRAINING to false. In case the evaluation is run, the resulting performance (accuracy) is written to files ending with .eval.

Adding new Models

When adding new models to the pom.xml, ensure to add new models to the expected-models.txt file located in opennlp-models-test. In addition, make sure a sha256 hash is computed on each binary artifact. The corresponding value must be set or updated correctly for each model type and language.

Contributing

The Apache OpenNLP project is developed by volunteers and is always looking for new contributors to work on all parts of the project. Every contribution is welcome and needed to make it better. A contribution can be anything from a small documentation typo fix to a new component.

If you would like to get involved please follow the instructions here