Home

Awesome

JamSpell

Build Status Release

JamSpell is a spell checking library with following features:

Colab example

JamSpellPro

jamspell.com - check out a new jamspell version with following features

Content

Benchmarks

<table> <tr> <td></td> <td>Errors</td> <td>Top 7 Errors</td> <td>Fix Rate</td> <td>Top 7 Fix Rate</td> <td>Broken</td> <td>Speed<br> (words/second)</td> </tr> <tr> <td>JamSpell</td> <td>3.25%</td> <td>1.27%</td> <td>79.53%</td> <td>84.10%</td> <td>0.64%</td> <td>4854</td> </tr> <tr> <td>Norvig</td> <td>7.62%</td> <td>5.00%</td> <td>46.58%</td> <td>66.51%</td> <td>0.69%</td> <td>395</td> </tr> <tr> <td>Hunspell</td> <td>13.10%</td> <td>10.33%</td> <td>47.52%</td> <td>68.56%</td> <td>7.14%</td> <td>163</td> </tr> <tr> <td>Dummy</td> <td>13.14%</td> <td>13.14%</td> <td>0.00%</td> <td>0.00%</td> <td>0.00%</td> <td>-</td> </tr> </table>

Model was trained on 300K wikipedia sentences + 300K news sentences (english). 95% was used for train, 5% was used for evaluation. Errors model was used to generate errored text from the original one. JamSpell corrector was compared with Norvig's one, Hunspell and a dummy one (no corrections).

We used following metrics:

To ensure that our model is not too overfitted for wikipedia+news we checked it on "The Adventures of Sherlock Holmes" text:

<table> <tr> <td></td> <td>Errors</td> <td>Top 7 Errors</td> <td>Fix Rate</td> <td>Top 7 Fix Rate</td> <td>Broken</td> <td>Speed (words per second)</td> </tr> <tr> <td>JamSpell</td> <td>3.56%</td> <td>1.27%</td> <td>72.03%</td> <td>79.73%</td> <td>0.50%</td> <td>5524</td> </tr> <tr> <td>Norvig</td> <td>7.60%</td> <td>5.30%</td> <td>35.43%</td> <td>56.06%</td> <td>0.45%</td> <td>647</td> </tr> <tr> <td>Hunspell</td> <td>9.36%</td> <td>6.44%</td> <td>39.61%</td> <td>65.77%</td> <td>2.95%</td> <td>284</td> </tr> <tr> <td>Dummy</td> <td>11.16%</td> <td>11.16%</td> <td>0.00%</td> <td>0.00%</td> <td>0.00%</td> <td>-</td> </tr> </table>

More details about reproducing available in "Train" section.

Usage

Python

  1. Install swig3 (usually it is in your distro package manager)

  2. Install jamspell:

pip install jamspell
  1. Download or train language model

  2. Use it:

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('I am the begt spell cherken!')
# u'I am the best spell checker!'

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 3)
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector.GetCandidates(['i', 'am', 'the', 'begt', 'spell', 'cherken'], 5)
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

  1. Add jamspell and contrib dirs to your project

  2. Use it:

#include <jamspell/spell_corrector.hpp>

int main(int argc, const char** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector.LoadLangModel("model.bin");

    corrector.FixFragment(L"I am the begt spell cherken!");
    // "I am the best spell checker!"

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector.GetCandidates({L"i", L"am", L"the", L"begt", L"spell", L"cherken"}, 3);
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0;
}

Other languages

You can generate extensions for other languages using swig tutorial. The swig interface file is jamspell.i. Pull requests with build scripts are welcome.

HTTP API

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
./web_server/web_server en.bin localhost 8080
$ curl "http://localhost:8080/fix?text=I am the begt spell cherken"
I am the best spell checker
$ curl -d "I am the begt spell cherken" http://localhost:8080/fix
I am the best spell checker
curl "http://localhost:8080/candidates?text=I am the begt spell cherken"
# or
curl -d "I am the begt spell cherken" http://localhost:8080/candidates
{
    "results": [
        {
            "candidates": [
                "best",
                "beat",
                "belt",
                "bet",
                "bent",
                "beet",
                "beit"
            ],
            "len": 4,
            "pos_from": 9
        },
        {
            "candidates": [
                "checker",
                "chicken",
                "checked",
                "wherein",
                "coherent",
                "cheered",
                "cherokee"
            ],
            "len": 7,
            "pos_from": 20
        }
    ]
}

Here pos_from - misspelled word first letter position, len - misspelled word len

Train

To train custom model you need:

  1. Install cmake

  2. Clone and build jamspell:

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make
  1. Prepare a utf-8 text file with sentences to train at (eg. sherlockholmes.txt) and another file with language alphabet (eg. alphabet_en.txt)

  2. Train model:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin
  1. To evaluate spellchecker you can use evaluate/evaluate.py script:
python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt
  1. You can use evaluate/generate_dataset.py to generate you train/test data. It supports txt files, Leipzig Corpora Collection format and fb2 books.

Download models

Here is a few simple models. They trained on 300K news + 300k wikipedia sentences. We strongly recommend to train your own model, at least on a few million sentences to achieve better quality. See Train section above.