Home

Awesome

NanigoNet

Masato Hagiwara

NanigoNet is a language detector for code-mixed input supporting 150 human and 19 programming languages implemented using AllenNLP+PyTorch.

The architecture of NanigoNet

Unlike other language detectors, NanigoNet detects language per character using a convolutional neural network-based sequential labeling model, which makes it suitable for code-mixed input where the language changes within the text (such as source code with comments, documents with markups, etc.). It can also produce prediction results for the entire text.

There is another language detector, LanideNN, which also makes a prediction per character. There are some notable differences between NanigoNet and LanideNN, including:

Many design decisions of NanigoNet, including the choice of the training data, are influenced by LanideNN. I hereby sincerely thank the authors of the software.

"Nanigo" (何語) means "what language" in Japanese.

Supported languages

See languages.tsv.

NanigoNet uses a unified set of languages IDs both for human and programming languages. Human languages are identified by a prefix h: + 3-letter ISO 639-2 code (for example, h:eng for English). Only exception is h:cmn-hans for Simplified Chinese and h:cmn-hant for Traditional Chinese.

For programming languages, it uses a prefix p: + file extension most commonly used for that language (for example, p:js for JavaScript and p:py for Python).

Pre-requisites

Install

Usage

From command line:

$ python run.py [path to model.tar.gz] < [input text file]

From Python code:

from nanigonet import NanigoNet

net = NanigoNet(model_path=[path to model.tar.gz])
texts = ['Hello!', '你好!']
results = net.predict_batch(texts)

This produces a JSON object (or a Python dictionary) per input instance. The keys of the object/dictionary are:

Example:

$ echo 'Hello!' | python run.py model.744k.256d.gcnn.11layers.tar.gz | jq .
{
  "char_probs": [
    {
      "h:eng": 0.9916031956672668,
      "h:mar": 0.004953697789460421,
      "h:sco": 0.0008433321490883827
    },
    ...
  "text_probs": {
    "h:eng": 0.9324732422828674,
    "h:ita": 0.0068493434228003025,
    "h:spa": 0.006260495167225599
  },
  "char_best": [
    "h:eng",
    "h:eng",
    "h:eng",
    "h:eng",
    "h:eng",
    "h:eng"
  ],
  "text_best": "h:eng"
}

Usage of run.py:

usage: run.py [-h] [--top-k TOP_K] [--cuda-device CUDA_DEVICE]
              [--batch-size BATCH_SIZE]
              archive_file

Parameters to the constructor of NanigoNet:

Notes