Home

Awesome

PySpark2PMML

Python package for converting Apache Spark ML pipelines to PMML.

Features

This package is a thin PySpark wrapper for the JPMML-SparkML library.

Prerequisites

Installation

Install a release version from PyPI:

pip install pyspark2pmml

Alternatively, install the latest snapshot version from GitHub:

pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git

Configuration and usage

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark versionJPMML-SparkML branchLatest JPMML-SparkML version
3.0.X2.0.X2.0.3
3.1.X2.1.X2.1.3
3.2.X2.2.X2.2.3
3.3.X2.3.X2.3.2
3.4.X2.4.X2.4.1
3.5.Xmaster2.5.0

Launch PySpark; use the --packages command-line option to specify the coordinates of relevant JPMML-SparkML modules:

Launching core:

pyspark --packages org.jpmml:pmml-sparkml:${version}

Fitting a Spark ML pipeline:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted Spark ML pipeline to a PMML file:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

The representation of individual Spark ML pipeline stages can be customized via conversion options:

from pyspark2pmml import PMMLBuilder

classifierModel = pipelineModel.stages[1]

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
	.putOption(classifierModel, "compact", False) \
	.putOption(classifierModel, "estimate_featureImportances", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

License

PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io