Home

Awesome

Build status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

String distances act over any pair of iterators that define length (e.g. AbstractStrings, GraphemeIterators, or AbstractVectors)

The available distances are:

Syntax

Following the Distances.jl package, string distances can inherit from two abstract types: StringSemiMetric <: SemiMetric or StringMetric <: Metric.

Computing the distance between two strings (or iterators)

You can always compute a certain distance between two strings using the following syntax

r = evaluate(dist, x, y)
r = dist(x, y)

Here, dist is an instance of a distance type: for example, the type for the Levenshtein distance is Levenshtein. You can compute the Levenshtein distance between x and y as

r = evaluate(Levenshtein(), x, y)
r = Levenshtein()(x, y)

The function compare returns the similarity score, defined as 1 minus the normalized distance between two strings. It always returns an element of type Float64. A value of 0.0 means completely different and a value of 1.0 means completely similar.

Levenshtein()("martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Computing the distance between two AbstractVectors of strings (or iterators)

Consider X and Y two AbstractVectors of iterators. You can compute the matrix of distances across elements, dist(X[i], Y[j]), as follows:

pairwise(dist, X, Y)

For instance,

pairwise(Jaccard(3), ["martha", "kitten"], ["marhta", "sitting"])

pairwise is optimized in various ways (e.g., for the case of QGram-distances, dictionary of qgrams are pre-computed)

Find closest string

The package also adds convenience functions to find elements in a iterator of strings closest to a given string

The functions findnearest and findall are particularly optimized for the Levenshtein and OptimalStringAlignment distances, as these algorithm can stop early if the distance becomes higher than a certain threshold.

fuzzywuzzy

The package also defines Distance "modifiers" that are inspired by the Python package - fuzzywuzzy. These modifiers are particularly helpful to match strings composed of multiple words (e.g. addresses, company names).

Levenshtein()("this string", "this string is longer") = 10
Partial(Levenshtein())("this string", "this string is longer") = 0

Notes