Home

Awesome

test

jaro_winkler is an implementation of Jaro-Winkler similarity algorithm which is written in C extension and will fallback to pure Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. Both of C and Ruby implementation support any kind of string encoding, such as UTF-8, EUC-JP, Big5, etc.

Installation

gem install jaro_winkler

Usage

require 'jaro_winkler'

# Jaro Winkler Similarity

JaroWinkler.similarity "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.similarity "MARTHA", "marhta", ignore_case: true
# => 0.9611
JaroWinkler.similarity "MARTHA", "MARHTA", weight: 0.2
# => 0.9778

# Jaro Similarity

JaroWinkler.jaro_similarity "MARTHA", "MARHTA"
# => 0.9444444444444445

There is no JaroWinkler.jaro_winkler_similarity, it's tediously long.

Options

NameTypeDefaultNote
ignore_casebooleanfalseAll lower case characters are converted to upper case prior to the comparison.
weightnumber0.1A constant scaling factor for how much the score is adjusted upwards for having common prefixes.
thresholdnumber0.7The prefix bonus is only added when the compared strings have a Jaro similarity above the threshold.
adj_tablebooleanfalseThe option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter "O" with the number "0".

Adjusting Table

Default Table

['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],
['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],
['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],
['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']

How it works?

Original Formula:

origin

where

With Adjusting Table:

adj

where

Why This?

There is also another similar gem named fuzzy-string-match which both provides C and Ruby version as well.

I reinvent this wheel because of the naming in fuzzy-string-match such as getDistance breaks convention, and some weird code like a1 = s1.split( // ) (s1.chars could be better), furthermore, it's bugged (see tables below).

Compare with other gems

jaro_winklerfuzzystringmatchhotwateramatch
Encoding SupportYesPure Ruby onlyNoNo
Windows SupportYes?NoYes
Adjusting TableYesNoNoNo
NativeYesYesYesYes
Pure RubyYesYesNoNo
Speed1st3rd2nd4th

I made a table below to compare accuracy between each gem:

str_1str_2originjaro_winklerfuzzystringmatchhotwateramatch
"henka""henkan"0.96670.96670.97220.96670.9444
"al""al"1.01.01.01.01.0
"martha""marhta"0.96110.96110.96110.96110.9444
"jones""johnson"0.83240.83240.83240.83240.7905
"abcvwxyz""cabvwxyz"0.95830.95830.95830.95830.9583
"dwayne""duane"0.840.840.840.840.8222
"dixon""dicksonx"0.81330.81330.81330.81330.7667
"fvie""ten"0.00.00.00.00.0

Benchmark

$ bundle exec rake benchmark
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]

# C Extension
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09)       0.240000   0.000000   0.240000 (  0.241347)
fuzzy-string-match (1.0.1)   0.400000   0.010000   0.410000 (  0.403673)
hotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254503)
amatch (0.4.0)               0.870000   0.000000   0.870000 (  0.875930)
----------------------------------------------------- total: 1.770000sec

                                 user     system      total        real
jaro_winkler (8c16e09)       0.230000   0.000000   0.230000 (  0.236921)
fuzzy-string-match (1.0.1)   0.380000   0.000000   0.380000 (  0.381942)
hotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254977)
amatch (0.4.0)               0.860000   0.000000   0.860000 (  0.861207)

# Pure Ruby
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.438470)
fuzzy-string-match (1.0.1)   0.860000   0.000000   0.860000 (  0.862850)
----------------------------------------------------- total: 1.300000sec

                                 user     system      total        real
jaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.439237)
fuzzy-string-match (1.0.1)   0.910000   0.010000   0.920000 (  0.920259)

Todo