Awesome

liblevenshtein-java-cli

Command-line interface to liblevenshein-java. Tagged releases of liblevenshtein-java-cli follow the corresponding, tagged releases of liblevenshtein-java.

Cloning the repository

$ git clone https://github.com/universal-automata/liblevenshtein-java-cli.git
Cloning into 'liblevenshtein-java-cli'...
remote: Counting objects: 61, done.
remote: Compressing objects: 100% (45/45), done.
remote: Total 61 (delta 7), reused 56 (delta 2), pack-reused 0
Unpacking objects: 100% (61/61), done.
Checking connectivity... done.

$ cd liblevenshtein-java-cli

Building the command-line interface

$ ./gradlew installDist
:compileJavawarning: No processor claimed any of these annotations: lombok.extern.slf4j.Slf4j,lombok.experimental.ExtensionMethod,lombok.Getter,lombok.RequiredArgsConstructor,edu.umd.cs.findbugs.annotations.SuppressFBWarnings
1 warning

:processResources
:classes
:jar
:startScripts
:installDist

BUILD SUCCESSFUL

Total time: 4.925 secs

This build could be faster, please consider using the Gradle Daemon: https://docs.gradle.org/2.12/userguide/gradle_daemon.html

Getting help on its usage

$ ./build/install/liblevenshtein-java-cli/bin/liblevenshtein-java-cli --help
20:00:34.433 [main] INFO  c.g.l.CommandLineInterface - Parsing command-line args [--help]
usage: liblevenshtein-java-cli [-a <ALGORITHM>] [--colorize] [-d
       <PATH|URI>] [-h] [-i] [-m <INTEGER>] [-q <STRING> <...>] [-s]
       [--serialize <PATH>] [--source-format <FORMAT>] [--target-format
       <FORMAT>]

Command-Line Interface to liblevenshtein (Java)

<FORMAT> specifies the serialization format of the dictionary,
and may be one of the following:
  1. PROTOBUF
     - (de)serialize the dictionary as a protobuf stream.
     - This is the preferred format.
     - See: https://developers.google.com/protocol-buffers/
  2. BYTECODE
     - (de)serialize the dictionary as a Java, bytecode stream.
  3. PLAIN_TEXT
     - (de)serialize the dictionary as a plain text file.
     - Terms are delimited by newlines.

<ALGORITHM> specifies the Levenshtein algorithm to use for
querying-against the dictionary, and may be one of the following:
  1. STANDARD
     - Use the standard, Levenshtein distance which considers the
     following elementary operations:
       o Insertion
       o Deletion
       o Substitution
     - An elementary operation is an operation that incurs a penalty of
     one unit.
  2. TRANSPOSITION
     - Extend the standard, Levenshtein distance to include transpositions
     as elementary operations.
       o A transposition is a swapping of two, consecutive characters as
       follows: ba -> ab
       o With the standard distance, this would require at least two
       operations:
         + An insertion and a deletion
         + A deletion and an insertion
         + Two substitutions
  3. MERGE_AND_SPLIT
     - Extend the standard, Levenshtein distance to include merges and
     splits as elementary operations.
       o A merge takes two characters and merges them into a single one.
         + For example: ab -> c
       o A split takes a single character and splits it into two others
         + For example: a -> bc
       o With the standard distance, these would require at least two
       operations:
         + Merge:
           > A deletion and a substitution
           > A substitution and a deletion
         + Split:
           > An insertion and a substitution
           > A substitution and an insertion

 -a,--algorithm <ALGORITHM>    Levenshtein algorithm to use (Default:
                               TRANSPOSITION)
    --colorize                 Colorize output
 -d,--dictionary <PATH|URI>    Filesystem path or Java-compatible URI to a
                               dictionary of terms
 -h,--help                     print this help text
 -i,--include-distance         Include the Levenshtein distance with each
                               spelling candidate (Default: false)
 -m,--max-distance <INTEGER>   Maximun, Levenshtein distance a spelling
                               candidatemay be from the query term
                               (Default: 2)
 -q,--query <STRING> <...>     Terms to query against the dictionary.  You
                               may specify multiple terms.
 -s,--is-sorted                Specifies that the dictionary is sorted
                               lexicographically, in ascending order
                               (Default: false)
    --serialize <PATH>         Path to save the serialized dictionary
    --source-format <FORMAT>   Format of the source dictionary (Default:
                               adaptively-try each format until one works)
    --target-format <FORMAT>   Format of the serialized dictionary
                               (Default: PROTOBUF)

Example: liblevenshtein-java-cli \
  --algorithm TRANSPOSITION \
  --max-distance 2 \
  --include-distance \
  --query mispelled mispelling \
  --colorize

Converting from Plain Text to Protocol Buffers

$ ./build/install/liblevenshtein-java-cli/bin/liblevenshtein-java-cli --dictionary https://raw.githubusercontent.com/universal-automata/liblevenshtein-java/2.2.1/src/test/resources/wordsEn.txt --source-format PLAIN_TEXT --serialize /tmp/dictionary.protobuf.bytes --target-format PROTOBUF
20:40:25.945 [main] INFO  c.g.l.CommandLineInterface - Parsing command-line args [--dictionary, https://raw.githubusercontent.com/universal-automata/liblevenshtein-java/2.2.1/src/test/resources/wordsEn.txt, --source-format, PLAIN_TEXT, --serialize, /tmp/dictionary.protobuf.bytes, --target-format, PROTOBUF]
20:40:26.909 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [10000] of [109582] terms
20:40:26.932 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [20000] of [109582] terms
20:40:26.954 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [30000] of [109582] terms
20:40:26.971 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [40000] of [109582] terms
20:40:26.987 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [50000] of [109582] terms
20:40:27.003 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [60000] of [109582] terms
20:40:27.021 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [70000] of [109582] terms
20:40:27.037 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [80000] of [109582] terms
20:40:27.052 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [90000] of [109582] terms
20:40:27.069 [main] INFO  c.g.d.l.collection.dawg.AbstractDawg - Added [100000] of [109582] terms
20:40:27.093 [main] INFO  c.g.d.l.l.factory.TransducerBuilder - Building transducer out of [109582] terms with algorithm [TRANSPOSITION], defaultMaxDistance [2], includeDistance [false], and maxCandidates [2147483647]
20:40:27.103 [main] INFO  c.g.l.CommandLineInterface - Serializing [109582] terms in the dictionary to [/tmp/dictionary.protobuf.bytes] as format [PROTOBUF]

Querying the dictionary while including candidate distances

$ ./build/install/liblevenshtein-java-cli/bin/liblevenshtein-java-cli --dictionary /tmp/dictionary.protobuf.bytes --source-format PROTOBUF --algorithm TRANSPOSITION --max-distance 2 --include-distance --query mispelled mispelling --colorize
12:24:09.029 [main] INFO  c.g.l.CommandLineInterface - Parsing command-line args [--dictionary, /tmp/dictionary.protobuf.bytes, --source-format, PROTOBUF, --algorithm, TRANSPOSITION, --max-distance, 2, --include-distance, --query, mispelled, mispelling, --colorize]
12:24:09.224 [main] INFO  c.g.d.l.l.factory.TransducerBuilder - Building transducer out of [109582] terms with algorithm [TRANSPOSITION], defaultMaxDistance [2], includeDistance [true], and maxCandidates [2147483647]
+-------------------------------------------------------------------------------
| Spelling Candidates for Query Term: "mispelled"
+-------------------------------------------------------------------------------
| d("mispelled", "spelled") = [2]
| d("mispelled", "impelled") = [2]
| d("mispelled", "dispelled") = [1]
| d("mispelled", "miscalled") = [2]
| d("mispelled", "respelled") = [2]
| d("mispelled", "misspelled") = [1]
+-------------------------------------------------------------------------------
| Spelling Candidates for Query Term: "mispelling"
+-------------------------------------------------------------------------------
| d("mispelling", "spelling") = [2]
| d("mispelling", "impelling") = [2]
| d("mispelling", "dispelling") = [1]
| d("mispelling", "misbilling") = [2]
| d("mispelling", "miscalling") = [2]
| d("mispelling", "misdealing") = [2]
| d("mispelling", "respelling") = [2]
| d("mispelling", "misspelling") = [1]
| d("mispelling", "misspellings") = [2]

Querying the dictionary without including candidate distances

$ ./build/install/liblevenshtein-java-cli/bin/liblevenshtein-java-cli --dictionary /tmp/dictionary.protobuf.bytes --source-format PROTOBUF --algorithm TRANSPOSITION --max-distance 2 --query mispelled mispelling --colorize
12:24:30.437 [main] INFO  c.g.l.CommandLineInterface - Parsing command-line args [--dictionary, /tmp/dictionary.protobuf.bytes, --source-format, PROTOBUF, --algorithm, TRANSPOSITION, --max-distance, 2, --query, mispelled, mispelling, --colorize]
12:24:30.636 [main] INFO  c.g.d.l.l.factory.TransducerBuilder - Building transducer out of [109582] terms with algorithm [TRANSPOSITION], defaultMaxDistance [2], includeDistance [false], and maxCandidates [2147483647]
+-------------------------------------------------------------------------------
| Spelling Candidates for Query Term: "mispelled"
+-------------------------------------------------------------------------------
| "mispelled" ~ "spelled"
| "mispelled" ~ "impelled"
| "mispelled" ~ "dispelled"
| "mispelled" ~ "miscalled"
| "mispelled" ~ "respelled"
| "mispelled" ~ "misspelled"
+-------------------------------------------------------------------------------
| Spelling Candidates for Query Term: "mispelling"
+-------------------------------------------------------------------------------
| "mispelling" ~ "spelling"
| "mispelling" ~ "impelling"
| "mispelling" ~ "dispelling"
| "mispelling" ~ "misbilling"
| "mispelling" ~ "miscalling"
| "mispelling" ~ "misdealing"
| "mispelling" ~ "respelling"
| "mispelling" ~ "misspelling"
| "mispelling" ~ "misspellings"

Supported, dictionary sources

The library is designed to read dictionaries from filesystem paths, Java-compatible URIs (including web URLs and Jar resources), process substitutions in Unix shells, and standard input (e.g. piped input).