Awesome

JPMML-SkLearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Features
- Overview
- Supported packages
Prerequisites
- The Python side of operations
- The JPMML-SkLearn side of operations
Installation
Usage
- The Python side of operations
- The JPMML-SkLearn side of operations
Documentation
License
Additional information

Features

Overview

Functionality:
- Three times more supported Python packages, transformers and estimators than all the competitors combined!
- Thorough collection, analysis and encoding of feature information:
  - Names.
  - Data and operational types.
  - Valid, invalid and missing value spaces.
  - Descriptive statistics.
- Pipeline extensions:
  - Pruning.
  - Decision engineering (prediction post-processing).
  - Model verification.
- Conversion options.
Extensibility:
- Rich Java APIs for developing custom converters.
- Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
- Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.

Supported packages

<details> <summary>Scikit-Learn</summary>

Examples: main.py

Probability Calibration:
- calibration.CalibratedClassifierCV
Clustering:
- cluster.KMeans
- cluster.MiniBatchKMeans
Composite estimators:
- compose.ColumnTransformer
- compose.TransformedTargetRegressor
Matrix decomposition:
- decomposition.PCA
- decomposition.IncrementalPCA
- decomposition.TruncatedSVD
Discriminant analysis:
- discriminant_analysis.LinearDiscriminantAnalysis
Dummies:
- dummy.DummyClassifier
- dummy.DummyRegressor
Ensemble methods:
- ensemble.AdaBoostRegressor
- ensemble.BaggingClassifier
- ensemble.BaggingRegressor
- ensemble.ExtraTreesClassifier
- ensemble.ExtraTreesRegressor
- ensemble.GradientBoostingClassifier
- ensemble.GradientBoostingRegressor
- ensemble.HistGradientBoostingClassifier
- ensemble.HistGradientBoostingRegressor
- ensemble.IsolationForest
- ensemble.RandomForestClassifier
- ensemble.RandomForestRegressor
- ensemble.StackingClassifier
- ensemble.StackingRegressor
- ensemble.VotingClassifier
- ensemble.VotingRegressor
Feature extraction:
- feature_extraction.DictVectorizer
- feature_extraction.text.CountVectorizer
- feature_extraction.text.TfidfVectorizer
Feature selection:
- feature_selection.GenericUnivariateSelect (only via sklearn2pmml.SelectorProxy)
- feature_selection.RFE (only via sklearn2pmml.SelectorProxy)
- feature_selection.RFECV (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFdr (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFpr (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFromModel (either directly or via sklearn2pmml.SelectorProxy)
- feature_selection.SelectFwe (only via sklearn2pmml.SelectorProxy)
- feature_selection.SelectKBest (either directly or via sklearn2pmml.SelectorProxy)
- feature_selection.SelectPercentile (only via sklearn2pmml.SelectorProxy)
- feature_selection.VarianceThreshold (only via sklearn2pmml.SelectorProxy)
Impute:
- impute.MissingIndicator
- impute.SimpleImputer
Isotonic regression:
- isotonic.IsotonicRegression
Generalized linear models:
- linear_model.ARDRegression
- linear_model.BayesianRidge
- linear_model.ElasticNet
- linear_model.ElasticNetCV
- linear_model.GammaRegressor
- linear_model.HuberRegressor
- linear_model.Lars
- linear_model.LarsCV
- linear_model.Lasso
- linear_model.LassoCV
- linear_model.LassoLars
- linear_model.LassoLarsCV
- linear_model.LinearRegression
- linear_model.LogisticRegression
- linear_model.LogisticRegressionCV
- linear_model.OrthogonalMatchingPursuit
- linear_model.OrthogonalMatchingPursuitCV
- linear_model.PoissonRegressor
- linear_model.QuantileRegressor
- linear_model.Ridge
- linear_model.RidgeCV
- linear_model.RidgeClassifier
- linear_model.RidgeClassifierCV
- linear_model.SGDClassifier
- linear_model.SGDOneClassSVM
- linear_model.SGDRegressor
- linear_model.TheilSenRegressor
Model selection:
- model_selection.GridSearchCV
- model_selection.RandomizedSearchCV
Multiclass classification:
- multiclass.OneVsRestClassifier
Multioutput regression and classification:
- multioutput.ClassifierChain
- multioutput.MultiOutputClassifier
- multioutput.MultiOutputRegressor
- multioutput.RegressorChain
Naive Bayes:
- naive_bayes.GaussianNB
Nearest neighbors:
- neighbors.KNeighborsClassifier
- neighbors.KNeighborsRegressor
- neighbors.NearestCentroid
- neighbors.NearestNeighbors
Pipelines:
- pipeline.FeatureUnion
- pipeline.Pipeline
Neural network models:
- neural_network.MLPClassifier
- neural_network.MLPRegressor
Preprocessing and normalization:
- preprocessing.Binarizer
- preprocessing.FunctionTransformer
- preprocessing.Imputer
- preprocessing.KBinsDiscretizer
- preprocessing.LabelBinarizer
- preprocessing.LabelEncoder
- preprocessing.MaxAbsScaler
- preprocessing.MinMaxScaler
- preprocessing.OneHotEncoder
- preprocessing.OrdinalEncoder
- preprocessing.PolynomialFeatures
- preprocessing.PowerTransformer
- preprocessing.RobustScaler
- preprocessing.SplineTransformer
- preprocessing.StandardScaler
- preprocessing.TargetEncoder
Support vector machines:
- svm.LinearSVC
- svm.LinearSVR
- svm.OneClassSVM
- svm.SVC
- svm.NuSVC
- svm.SVR
- svm.NuSVR
Decision trees:
- tree.DecisionTreeClassifier
- tree.DecisionTreeRegressor
- tree.ExtraTreeClassifier
- tree.ExtraTreeRegressor

</details> <details> <summary>BorutaPy</summary>

Examples: extensions/boruta.py

boruta.BorutaPy

</details> <details> <summary>Category Encoders</summary>

Examples: extensions/category_encoders.py and extensions/category_encoders-xgboost.py

</details> <details> <summary>H2O.ai</summary>

Examples: main-h2o.py

</details> <details> <summary>Hyperopt-sklearn</summary>

Examples: extensions/hpsklearn.py

hpsklearn.HyperoptEstimator

</details> <details> <summary>Imbalanced-Learn</summary>

Examples: extensions/imblearn.py

Under-sampling methods:
Over-sampling methods:
Combination of over- and under-sampling methods:
- imblearn.combine.SMOTEENN
- imblearn.combine.SMOTETomek
Ensemble methods:
- imblearn.ensemble.BalancedBaggingClassifier
- imblearn,ensemble,BalancedRandomForestClassifier
Pipeline:
- imblearn.pipeline.Pipeline

</details> <details> <summary>InterpretML</summary>

Examples: extensions/interpret.py

</details> <details> <summary>LightGBM</summary>

Examples: main-lightgbm.py

</details> <details> <summary>Mlxtend</summary>

Examples: N/A

mlxtend.preprocessing.DenseTransformer

</details> <details> <summary>OptBinning</summary>

Examples: extensions/optbinning.py

</details> <details> <summary>PyCaret</summary>

Examples: extensions/pycaret.py

pycaret.internal.pipeline.Pipeline
pycaret.internal.preprocess.transformers.CleanColumnNames
pycaret.internal.preprocess.transformers.FixImbalancer
pycaret.internal.preprocess.transformers.RareCategoryGrouping
pycaret.internal.preprocess.transformers.RemoveMulticollinearity
pycaret.internal.preprocess.transformers.RemoveOutliers
pycaret.internal.preprocess.transformers.TransformerWrapper
pycaret.internal.preprocess.transformers.TransformerWrapperWithInverse

</details> <details> <summary>Scikit-Lego</summary>

Examples: extensions/sklego.py

sklego.meta.EstimatorTransformer
- Predict functions apply, decision_function, predict and predict_proba.
sklego.meta.OrdinalClassifier
sklego.pipeline.DebugPipeline
sklego.preprocessing.IdentityTransformer

</details> <details> <summary>Scikit-Tree</summary>

Examples: extensions/sktree.py

</details> <details> <summary>SkLearn2PMML</summary>

Examples: main.py and extensions/sklearn2pmml.py

Helpers:
- sklearn2pmml.EstimatorProxy
- sklearn2pmml.SelectorProxy
- sklearn2pmml.h2o.H2OEstimatorProxy
Feature cross-references:
- sklearn2pmml.cross_reference.Memorizer
- sklearn2pmml.cross_reference.Recaller
Feature specification and decoration:
- sklearn2pmml.decoration.Alias
- sklearn2pmml.decoration.CategoricalDomain
- sklearn2pmml.decoration.ContinuousDomain
- sklearn2pmml.decoration.ContinuousDomainEraser
- sklearn2pmml.decoration.DateDomain
- sklearn2pmml.decoration.DateTimeDomain
- sklearn2pmml.decoration.DiscreteDomainEraser
- sklearn2pmml.decoration.MultiAlias
- sklearn2pmml.decoration.MultiDomain
- sklearn2pmml.decoration.OrdinalDomain
Ensemble methods:
- sklearn2pmml.ensemble.EstimatorChain
- sklearn2pmml.ensemble.GBDTLMRegressor
  - The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
  - The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
- sklearn2pmml.ensemble.GBDTLRClassifier
  - The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
  - The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
- sklearn2pmml.ensemble.SelectFirstClassifier
- sklearn2pmml.ensemble.SelectFirstRegressor
UDF models:
- sklearn2pmml.expression.ExpressionClassifier
- sklearn2pmml.expression.ExpressionRegressor
Feature selection:
- sklearn2pmml.feature_selection.SelectUnique
Linear models:
- sklearn2pmml.statsmodels.StatsModelsClassifier
- sklearn2pmml.statsmodels.StatsModelsOrdinalClassifier
- sklearn2pmml.statsmodels.StatsModelsRegressor
Neural networks:
- sklearn2pmml.neural_network.MLPTransformer
Pipeline:
- sklearn2pmml.pipeline.PMMLPipeline
Postprocessing:
- sklearn2pmml.postprocessing.BusinessDecisionTransformer
Preprocessing:
- sklearn2pmml.preprocessing.Aggregator
- sklearn2pmml.preprocessing.BSplineTransformer
- sklearn2pmml.preprocessing.CastTransformer
- sklearn2pmml.preprocessing.ConcatTransformer
- sklearn2pmml.preprocessing.CutTransformer
- sklearn2pmml.preprocessing.DataFrameConstructor
- sklearn2pmml.preprocessing.DateTimeFormatter
- sklearn2pmml.preprocessing.DaysSinceYearTransformer
- sklearn2pmml.preprocessing.ExpressionTransformer
  - Ternary conditional expression <expression_true> if <condition> else <expression_false>.
  - Array indexing expressions X[<column index>] and X[<column name>].
  - String concatenation expressions.
  - String slicing expressions <str>[<start>:<stop>].
  - Arithmetic operators +, -, *, / and %.
  - The power operator **.
  - Identity comparison operators is None and is not None.
  - Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
  - Logical operators and, or and not.
  - Math constants math.e, math.nan, math.pi and math.tau.
  - Math functions (too numerous to list).
  - Numpy constants numpy.e, numpy.NaN. numpy.NZERO, numpy.pi and numpy.PZERO.
  - Numpy function numpy.where.
  - Numpy universal functions (too numerous to list).
  - Pandas constants pandas.NA and pandas.NaT.
  - Pandas functions pandas.isna, pandas.isnull, pandas.notna and pandas.notnull.
  - Scipy functions scipy.special.expit and scipy.special.logit.
  - String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
  - String length function len(<str>).
  - Perl Compatible Regular Expression (PCRE) functions pcre.search and pcre.sub.
  - Regular Expression (RE) functions re.search, and re.sub.
  - User-defined functions.
- sklearn2pmml.preprocessing.FilterLookupTransformer
- sklearn2pmml.preprocessing.IdentityTransformer
- sklearn2pmml.preprocessing.LookupTransformer
- sklearn2pmml.preprocessing.MatchesTransformer
- sklearn2pmml.preprocessing.MultiLookupTransformer
- sklearn2pmml.preprocessing.NumberFormatter
- sklearn2pmml.preprocessing.PMMLLabelBinarizer
- sklearn2pmml.preprocessing.PMMLLabelEncoder
- sklearn2pmml.preprocessing.PowerFunctionTransformer
- sklearn2pmml.preprocessing.ReplaceTransformer
- sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
- sklearn2pmml.preprocessing.SecondsSinceYearTransformer
- sklearn2pmml.preprocessing.SelectFirstTransformer
- sklearn2pmml.preprocessing.SeriesConstructor
- sklearn2pmml.preprocessing.StringNormalizer
- sklearn2pmml.preprocessing.SubstringTransformer
- sklearn2pmml.preprocessing.WordCountTransformer
- sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
- sklearn2pmml.util.Reshaper
- sklearn2pmml.util.Slicer
Rule sets:
- sklearn2pmml.ruleset.RuleSetClassifier
Decision trees:
- sklearn2pmml.tree.chaid.CHAIDClassifier
- sklearn2pmml.tree.chaid.CHAIDRegressor

</details> <details> <summary>Sklearn-Pandas</summary>

Examples: main.py

sklearn_pandas.CategoricalImputer
sklearn_pandas.DataFrameMapper

</details> <details> <summary>StatsModels</summary>

Examples: main-statsmodels.py

</details> <details> <summary>TPOT</summary>

Examples: extensions/tpot.py

tpot.builtins.stacking_estimator.StackingEstimator

</details> <details> <summary>XGBoost</summary>

Examples: main-xgboost.py, extensions/category_encoders-xgboost.py and extensions/categorical.py

</details>

Prerequisites

The Python side of operations

Python 2.7, 3.4 or newer.
scikit-learn 0.16.0 or newer.
sklearn-pandas 0.0.10 or newer.
sklearn2pmml 0.14.0 or newer.

Validating Python installation:

import joblib, sklearn, sklearn_pandas, sklearn2pmml

print(joblib.__version__)
print(sklearn.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.8-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

Use Python to train a model.
Serialize the model in pickle data format to a file in a local filesystem.
Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --help

Documentation

Integrations:

Extensions:

Miscellaneous:

Archived:

Converting Scikit-Learn to PMML

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io

Awesome

JPMML-SkLearn

Table of Contents

Features

Overview

Supported packages

Prerequisites

The Python side of operations

The JPMML-SkLearn side of operations

Installation

Usage

The Python side of operations

The JPMML-SkLearn side of operations

Documentation

License

Additional information