Awesome
SkLearn2PMML
Python package for converting Scikit-Learn pipelines to PMML.
Features
This package is a thin Python wrapper around the JPMML-SkLearn library.
News and Updates
The current version is 0.112.1 (14 December, 2024):
pip install sklearn2pmml==0.112.1
See the NEWS.md file.
Prerequisites
- Java 1.8 or newer. The Java executable must be available on system path.
- Python 3.8 or newer.
Installation
Installing a release version from PyPI:
pip install sklearn2pmml
Alternatively, installing the latest snapshot version from GitHub:
pip install --upgrade git+https://github.com/jpmml/sklearn2pmml.git
Usage
Command-line application
The sklearn2pmml
module is executable.
The main application loads the estimator object from the Pickle file (-i
or --input
; supports joblib
, pickle
or dill
variants), performs the conversion, and saves the result to a PMML file (-o
or --output
):
python -m sklearn2pmml --input pipeline.pkl --output pipeline.pmml
Getting help:
python -m sklearn2pmml --help
On some platforms, the Pip package installer additionally makes the main application available as a top-level command:
sklearn2pmml --input pipeline.pkl --output pipeline.pmml
Library
A typical workflow can be summarized as follows:
- Create a
PMMLPipeline
object, and populate it with pipeline steps as usual. Thesklearn2pmml.pipeline.PMMLPipeline
class extends thesklearn.pipeline.Pipeline
class with the following functionality:
- If the
PMMLPipeline.fit(X, y)
method is invoked withpandas.DataFrame
orpandas.Series
object as anX
argument, then its column names are used as feature names. Otherwise, feature names default to "x1", "x2", .., "x{number_of_features}". - If the
PMMLPipeline.fit(X, y)
method is invoked withpandas.Series
object as any
argument, then its name is used as the target name (for supervised models). Otherwise, the target name defaults to "y".
- Fit and validate the pipeline as usual.
- Optionally, compute and embed verification data into the
PMMLPipeline
object by invokingPMMLPipeline.verify(X)
method with a small but representative subset of training data. - Convert the
PMMLPipeline
object to a PMML file in local filesystem by invoking thesklearn2pmml.sklearn2pmml(estimator, pmml_path)
utility method.
Developing a simple decision tree model for the classification of iris species:
import pandas
iris_df = pandas.read_csv("Iris.csv")
iris_X = iris_df[iris_df.columns.difference(["Species"])]
iris_y = iris_df["Species"]
from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)
Developing a more elaborate logistic regression model for the same:
import pandas
iris_df = pandas.read_csv("Iris.csv")
iris_X = iris_df[iris_df.columns.difference(["Species"])]
iris_y = iris_df["Species"]
from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), SimpleImputer()])
])),
("pca", PCA(n_components = 3)),
("selector", SelectKBest(k = 2)),
("classifier", LogisticRegression(multi_class = "ovr"))
])
pipeline.fit(iris_X, iris_y)
pipeline.verify(iris_X.sample(n = 15))
from sklearn2pmml import sklearn2pmml
sklearn2pmml(pipeline, "LogisticRegressionIris.pmml", with_repr = True)
Documentation
Integrations:
- Training Scikit-Learn GridSearchCV StatsModels pipelines
- Converting Scikit-Learn H2O.ai pipelines to PMML
- Converting customized Scikit-Learn estimators to PMML
- Training Scikit-Learn StatsModels pipelines
- Upgrading Scikit-Learn XGBoost pipelines
- Training Python-based XGBoost accelerated failure time models
- Converting Scikit-Learn PyCaret 3 pipelines to PMML
- Training Scikit-Learn H2O.ai pipelines
- One-hot encoding categorical features in Scikit-Learn XGBoost pipelines
- Training Scikit-Learn TF(-IDF) plus XGBoost pipelines
- Converting Scikit-Learn TF(-IDF) pipelines to PMML
- Converting Scikit-Learn Imbalanced-Learn pipelines to PMML
- Converting logistic regression models to PMML
- Stacking Scikit-Learn, LightGBM and XGBoost models
- Converting Scikit-Learn GridSearchCV pipelines to PMML
- Converting Scikit-Learn TPOT pipelines to PMML
- Converting Scikit-Learn LightGBM pipelines to PMML
Extensions:
- Extending Scikit-Learn with feature cross-references
- Extending Scikit-Learn with UDF expression transformer
- Extending Scikit-Learn with CHAID models
- Extending Scikit-Learn with prediction post-processing
- Extending Scikit-Learn with outlier detector transformer
- Extending Scikit-Learn with date and datetime features
- Extending Scikit-Learn with feature specifications
- Extending Scikit-Learn with GBDT+LR ensemble models
- Extending Scikit-Learn with business rules model
Miscellaneous:
- Upgrading Scikit-Learn decision tree models
- Measuring the memory consumption of Scikit-Learn models
- Benchmarking Scikit-Learn against JPMML-Evaluator
- Analyzing Scikit-Learn feature importances via PMML
Archived:
De-installation
Uninstalling:
pip uninstall sklearn2pmml
License
SkLearn2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use SkLearn2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes SkLearn2PMML available under the terms and conditions of the BSD 3-Clause License instead.
Additional information
SkLearn2PMML is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact info@openscoring.io