Home

Awesome

splink_scalaudfs

This module was started as an extension of an example provided in [1] Phillip Lee (ONS) has created an example of a UDF defined in Scala, callable from PySpark, that wraps a call to JaroWinklerDistance from Apache commons.

After understanding how this mechanism worked the intention was to add more text distance and similarity metrics from Apache Commons for use in fuzzy matching in Pyspark in splink. Down the line some additional utility functions were added. More functions will be added for sure as time goes on ...

Current plan is to make this package available through the maven/ivy repositories (think PyPI in Python) as uk.gov.moj.dash.linkage so that it can be accessed more easily from AWS-Glue jobs

TLDR

Latest jar can be found at: jars/scala-udf-similarity-0.1.1_spark3.x.jar


Usage

To build the Jar (currently the Maven Build System is used):

mvn package

To add the jar to PySpark set the following config:

spark.driver.extraClassPath /path/to/jarfile.jar
spark.jars /path/to/jarfile.jar

To register the functions with PySpark > 2.3.1:


spark.udf.registerJavaFunction('jaro_winkler', 'uk.gov.moj.dash.linkage.JaroWinklerSimilarity',\ 
                                pyspark.sql.types.DoubleType())
                                
                                
spark.udf.registerJavaFunction('jaccard_sim', 'uk.gov.moj.dash.linkage.JaccardSimilarity',\ 
                                pyspark.sql.types.DoubleType())                          
                                
spark.udf.registerJavaFunction('cosine_distance', 'uk.gov.moj.dash.linkage.CosineDistance',\ 
                                pyspark.sql.types.DoubleType())

spark.udf.registerJavaFunction('sqlEscape', 'uk.gov.moj.dash.linkage.sqlEscape',\ 
                                pyspark.sql.types.StringType())                        
.

spark.udf.registerJavaFunction('levdamerau_distance', 'uk.gov.moj.dash.linkage.LevDamerauDistance',\ 
                                pyspark.sql.types.DoubleType())   
.
spark.udf.registerJavaFunction('jaro_sim', 'uk.gov.moj.dash.linkage.JaroSimilarity',\ 
                                pyspark.sql.types.DoubleType())   

from pyspark.sql import types as T
rt = T.ArrayType(T.StructType([T.StructField("_1",T.StringType()), 
                               T.StructField("_2",T.StringType())]))
                               
spark.udf.registerJavaFunction(name='DualArrayExplode', 
            javaClassName='uk.gov.moj.dash.linkage.DualArrayExplode', returnType=rt)
                                
.
.
.


rt2 = T.ArrayType(
    T.StructType((
        T.StructField(
            "place1",
            T.StructType((
                T.StructField("lat", T.DoubleType()),
                T.StructField("long", T.DoubleType())
            ))
        ),
        T.StructField(
            "place2",
            T.StructType((
                T.StructField("lat", T.DoubleType()),
                T.StructField("long", T.DoubleType())
            ))
        )
    ))
)

                               
spark.udf.registerJavaFunction(name='latlongexplode', 
            javaClassName='uk.gov.moj.dash.linkage.latlongexplode', returnType=rt2)

                                

JVM? How can I install this on my macbook?

Maven is written in Java (and primarily used for building JVM programs).

If you dont have one ,installing either OpenJDK or AWS Correto JDK is recommended.Oracle JDK is not anymore.


Scala? How can I install this on my macbook?

Install sdkman

An easy way to install Scala on your macbook is to use sdkman

curl -s "https://get.sdkman.io" | bash

and follow the instructions on screen

source "$HOME/.sdkman/bin/sdkman-init.sh"

if you use zsh ensure that these lines are the last lines of your .zshrc file.

#THIS MUST BE AT THE END OF THE FILE FOR SDKMAN TO WORK!!!
export SDKMAN_DIR="/Users/MYUSERNAME/.sdkman"
[[ -s "/Users/MYUSERNAME/.sdkman/bin/sdkman-init.sh" ]] && source "/Users/MYUSERNAME/.sdkman/bin/sdkman-init.sh" 

where MYUSERNAME is your mac username.If they are not make them so!

Install Scala via sdkman

It is important to install a version of Scala that is in the 2.11 to 2.12 range as these are the ones compatible with Spark 2.4.5

Now that you have sdkman installed run

sdk install scala 2.12.10

voila. you have Scala now on your machine.

Run scala -version and scalac -version to check


Maven? How can I install this on my macbook?

Manual way:

sdkman way:

sdk install maven

lol. sdkman is quite useful isnt it?


Ok everything installed. How can I build a jar file to use with Pyspark?


Progress

v.0.1.1

v.0.1.0

v.0.0.10

v.0.0.9

v.0.0.8

v.0.0.7

v.0.0.6

v.0.0.5

v.0.0.4

v.0.0.3

v.0.0.2

v.0.0.1


References:

[1] Using a Scala UDF Example @philip-lee-ons