Awesome
JPMML-SkLearn
Java library and command-line application for converting Scikit-Learn pipelines to PMML.
Table of Contents
Features
Overview
- Functionality:
- Three times more supported Python packages, transformers and estimators than all the competitors combined!
- Thorough collection, analysis and encoding of feature information:
- Names.
- Data and operational types.
- Valid, invalid and missing value spaces.
- Descriptive statistics.
- Pipeline extensions:
- Pruning.
- Decision engineering (prediction post-processing).
- Model verification.
- Conversion options.
- Extensibility:
- Rich Java APIs for developing custom converters.
- Automatic discovery and registration of custom converters based on
META-INF/sklearn2pmml.properties
resource files. - Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
- Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.
Supported packages
<details> <summary>Scikit-Learn</summary>Examples: main.py
- Probability Calibration:
- Clustering:
- Composite estimators:
- Matrix decomposition:
- Discriminant analysis:
- Dummies:
- Ensemble methods:
ensemble.AdaBoostRegressor
ensemble.BaggingClassifier
ensemble.BaggingRegressor
ensemble.ExtraTreesClassifier
ensemble.ExtraTreesRegressor
ensemble.GradientBoostingClassifier
ensemble.GradientBoostingRegressor
ensemble.HistGradientBoostingClassifier
ensemble.HistGradientBoostingRegressor
ensemble.IsolationForest
ensemble.RandomForestClassifier
ensemble.RandomForestRegressor
ensemble.StackingClassifier
ensemble.StackingRegressor
ensemble.VotingClassifier
ensemble.VotingRegressor
- Feature extraction:
- Feature selection:
feature_selection.GenericUnivariateSelect
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFE
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFECV
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFdr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFpr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFromModel
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectFwe
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectKBest
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectPercentile
(only viasklearn2pmml.SelectorProxy
)feature_selection.VarianceThreshold
(only viasklearn2pmml.SelectorProxy
)
- Impute:
- Isotonic regression:
- Generalized linear models:
linear_model.ARDRegression
linear_model.BayesianRidge
linear_model.ElasticNet
linear_model.ElasticNetCV
linear_model.GammaRegressor
linear_model.HuberRegressor
linear_model.Lars
linear_model.LarsCV
linear_model.Lasso
linear_model.LassoCV
linear_model.LassoLars
linear_model.LassoLarsCV
linear_model.LinearRegression
linear_model.LogisticRegression
linear_model.LogisticRegressionCV
linear_model.OrthogonalMatchingPursuit
linear_model.OrthogonalMatchingPursuitCV
linear_model.PoissonRegressor
linear_model.QuantileRegressor
linear_model.Ridge
linear_model.RidgeCV
linear_model.RidgeClassifier
linear_model.RidgeClassifierCV
linear_model.SGDClassifier
linear_model.SGDOneClassSVM
linear_model.SGDRegressor
linear_model.TheilSenRegressor
- Model selection:
- Multiclass classification:
- Multioutput regression and classification:
- Naive Bayes:
- Nearest neighbors:
- Pipelines:
- Neural network models:
- Preprocessing and normalization:
preprocessing.Binarizer
preprocessing.FunctionTransformer
preprocessing.Imputer
preprocessing.KBinsDiscretizer
preprocessing.LabelBinarizer
preprocessing.LabelEncoder
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.OneHotEncoder
preprocessing.OrdinalEncoder
preprocessing.PolynomialFeatures
preprocessing.PowerTransformer
preprocessing.RobustScaler
preprocessing.SplineTransformer
preprocessing.StandardScaler
preprocessing.TargetEncoder
- Support vector machines:
- Decision trees:
Examples: extensions/boruta.py
boruta.BorutaPy
Examples: extensions/category_encoders.py and extensions/category_encoders-xgboost.py
category_encoders.BaseNEncoder
category_encoders.BinaryEncoder
category_encoders.CatBoostEncoder
category_encoders.CountEncoder
category_encoders.LeaveOneOutEncoder
category_encoders.OneHotEncoder
category_encoders.OrdinalEncoder
category_encoders.TargetEncoder
category_encoders.WOEEncoder
Examples: main-h2o.py
h2o.estimators.extended_isolation_forest.H2OExtendedIsolationForestEstimator
h2o.estimators.gbm.H2OGradientBoostingEstimator
h2o.estimators.glm.H2OGeneralizedLinearEstimator
h2o.estimators.isolation_forest.H2OIsolationForestEstimator
h2o.estimators.random_forest.H2ORandomForestEstimator
h2o.estimators.stackedensemble.H2OStackedEnsembleEstimator
h2o.estimators.xgboost.H2OXGBoostEstimator
Examples: extensions/hpsklearn.py
hpsklearn.HyperoptEstimator
Examples: extensions/imblearn.py
- Under-sampling methods:
imblearn.under_sampling.AllKNN
imblearn.under_sampling.ClusterCentroids
imblearn.under_sampling.CondensedNearestNeighbour
imblearn.under_sampling.EditedNearestNeighbours
imblearn.under_sampling.InstanceHardnessThreshold
imblearn.under_sampling.NearMiss
imblearn.under_sampling.NeighbourhoodCleaningRule
imblearn.under_sampling.OneSidedSelection
imblearn.under_sampling.RandomUnderSampler
imblearn.under_sampling.RepeatedEditedNearestNeighbours
imblearn.under_sampling.TomekLinks
- Over-sampling methods:
- Combination of over- and under-sampling methods:
- Ensemble methods:
- Pipeline:
Examples: extensions/interpret.py
interpret.glassbox.ClassificationTree
interpret.glassbox.ExplainableBoostingClassifier
interpret.glassbox.ExplainableBoostingRegressor
interpret.glassbox.LinearRegression
interpret.glassbox.LogisticRegression
interpret.glassbox.RegressionTree
Examples: main-lightgbm.py
</details> <details> <summary>Mlxtend</summary>Examples: N/A
</details> <details> <summary>OptBinning</summary>Examples: extensions/optbinning.py
optbinning.BinningProcess
optbinning.ContinuousOptimalBinning
optbinning.MulticlassOptimalBinning
optbinning.OptimalBinning
optbinning.scorecard.Scorecard
Examples: extensions/pycaret.py
pycaret.internal.pipeline.Pipeline
pycaret.internal.preprocess.transformers.CleanColumnNames
pycaret.internal.preprocess.transformers.FixImbalancer
pycaret.internal.preprocess.transformers.RareCategoryGrouping
pycaret.internal.preprocess.transformers.RemoveMulticollinearity
pycaret.internal.preprocess.transformers.RemoveOutliers
pycaret.internal.preprocess.transformers.TransformerWrapper
pycaret.internal.preprocess.transformers.TransformerWrapperWithInverse
Examples: extensions/sklego.py
sklego.meta.EstimatorTransformer
- Predict functions
apply
,decision_function
,predict
andpredict_proba
.
- Predict functions
sklego.meta.OrdinalClassifier
sklego.pipeline.DebugPipeline
sklego.preprocessing.IdentityTransformer
Examples: extensions/sktree.py
sktree.ensemble.ExtendedIsolationForest
sktree.ensemble.ObliqueRandomForestClassifier
sktree.ensemble.ObliqueRandomForestRegressor
sktree.tree.ObliqueDecisionTreeClassifier
sktree.tree.ObliqueDecisionTreeRegressor
Examples: main.py and extensions/sklearn2pmml.py
- Helpers:
sklearn2pmml.EstimatorProxy
sklearn2pmml.SelectorProxy
sklearn2pmml.h2o.H2OEstimatorProxy
- Feature cross-references:
sklearn2pmml.cross_reference.Memorizer
sklearn2pmml.cross_reference.Recaller
- Feature specification and decoration:
sklearn2pmml.decoration.Alias
sklearn2pmml.decoration.CategoricalDomain
sklearn2pmml.decoration.ContinuousDomain
sklearn2pmml.decoration.ContinuousDomainEraser
sklearn2pmml.decoration.DateDomain
sklearn2pmml.decoration.DateTimeDomain
sklearn2pmml.decoration.DiscreteDomainEraser
sklearn2pmml.decoration.MultiAlias
sklearn2pmml.decoration.MultiDomain
sklearn2pmml.decoration.OrdinalDomain
- Ensemble methods:
sklearn2pmml.ensemble.EstimatorChain
sklearn2pmml.ensemble.GBDTLMRegressor
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
LGBMRegressor
,XGBRegressor
,XGBRFRegressor
. - The LM side: A Scikit-Learn linear regressor (eg.
ElasticNet
,LinearRegression
,SGDRegressor
).
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
sklearn2pmml.ensemble.GBDTLRClassifier
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
LGBMClassifier
,XGBClassifier
,XGBRFClassifier
. - The LR side: A Scikit-Learn binary linear classifier (eg.
LinearSVC
,LogisticRegression
,SGDClassifier
).
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
sklearn2pmml.ensemble.SelectFirstClassifier
sklearn2pmml.ensemble.SelectFirstRegressor
- UDF models:
sklearn2pmml.expression.ExpressionClassifier
sklearn2pmml.expression.ExpressionRegressor
- Feature selection:
sklearn2pmml.feature_selection.SelectUnique
- Linear models:
sklearn2pmml.statsmodels.StatsModelsClassifier
sklearn2pmml.statsmodels.StatsModelsOrdinalClassifier
sklearn2pmml.statsmodels.StatsModelsRegressor
- Neural networks:
sklearn2pmml.neural_network.MLPTransformer
- Pipeline:
sklearn2pmml.pipeline.PMMLPipeline
- Postprocessing:
sklearn2pmml.postprocessing.BusinessDecisionTransformer
- Preprocessing:
sklearn2pmml.preprocessing.Aggregator
sklearn2pmml.preprocessing.BSplineTransformer
sklearn2pmml.preprocessing.CastTransformer
sklearn2pmml.preprocessing.ConcatTransformer
sklearn2pmml.preprocessing.CutTransformer
sklearn2pmml.preprocessing.DataFrameConstructor
sklearn2pmml.preprocessing.DateTimeFormatter
sklearn2pmml.preprocessing.DaysSinceYearTransformer
sklearn2pmml.preprocessing.ExpressionTransformer
- Ternary conditional expression
<expression_true> if <condition> else <expression_false>
. - Array indexing expressions
X[<column index>]
andX[<column name>]
. - String concatenation expressions.
- String slicing expressions
<str>[<start>:<stop>]
. - Arithmetic operators
+
,-
,*
,/
and%
. - The power operator
**
. - Identity comparison operators
is None
andis not None
. - Comparison operators
in <list>
,not in <list>
,<=
,<
,==
,!=
,>
and>=
. - Logical operators
and
,or
andnot
. - Math constants
math.e
,math.nan
,math.pi
andmath.tau
. - Math functions (too numerous to list).
- Numpy constants
numpy.e
,numpy.NaN
.numpy.NZERO
,numpy.pi
andnumpy.PZERO
. - Numpy function
numpy.where
. - Numpy universal functions (too numerous to list).
- Pandas constants
pandas.NA
andpandas.NaT
. - Pandas functions
pandas.isna
,pandas.isnull
,pandas.notna
andpandas.notnull
. - Scipy functions
scipy.special.expit
andscipy.special.logit
. - String functions
startswith(<prefix>)
,endswith(<suffix>)
,lower
,upper
andstrip
. - String length function
len(<str>)
. - Perl Compatible Regular Expression (PCRE) functions
pcre.search
andpcre.sub
. - Regular Expression (RE) functions
re.search
, andre.sub
. - User-defined functions.
- Ternary conditional expression
sklearn2pmml.preprocessing.FilterLookupTransformer
sklearn2pmml.preprocessing.IdentityTransformer
sklearn2pmml.preprocessing.LookupTransformer
sklearn2pmml.preprocessing.MatchesTransformer
sklearn2pmml.preprocessing.MultiLookupTransformer
sklearn2pmml.preprocessing.NumberFormatter
sklearn2pmml.preprocessing.PMMLLabelBinarizer
sklearn2pmml.preprocessing.PMMLLabelEncoder
sklearn2pmml.preprocessing.PowerFunctionTransformer
sklearn2pmml.preprocessing.ReplaceTransformer
sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
sklearn2pmml.preprocessing.SecondsSinceYearTransformer
sklearn2pmml.preprocessing.SelectFirstTransformer
sklearn2pmml.preprocessing.SeriesConstructor
sklearn2pmml.preprocessing.StringNormalizer
sklearn2pmml.preprocessing.SubstringTransformer
sklearn2pmml.preprocessing.WordCountTransformer
sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
sklearn2pmml.util.Reshaper
sklearn2pmml.util.Slicer
- Rule sets:
sklearn2pmml.ruleset.RuleSetClassifier
- Decision trees:
sklearn2pmml.tree.chaid.CHAIDClassifier
sklearn2pmml.tree.chaid.CHAIDRegressor
Examples: main.py
sklearn_pandas.CategoricalImputer
sklearn_pandas.DataFrameMapper
Examples: main-statsmodels.py
statsmodels.api.GLM
statsmodels.api.Logit
statsmodels.api.MNLogit
statsmodels.api.OLS
statsmodels.api.Poisson
statsmodels.api.QuantReg
statsmodels.api.WLS
statsmodels.miscmodels.ordinal_model.OrderedModel
Examples: extensions/tpot.py
tpot.builtins.stacking_estimator.StackingEstimator
Examples: main-xgboost.py, extensions/category_encoders-xgboost.py and extensions/categorical.py
xgboost.XGBClassifier
xgboost.XGBRanker
xgboost.XGBRegressor
xgboost.XGBRFClassifier
xgboost.XGBRFRegressor
Prerequisites
The Python side of operations
- Python 2.7, 3.4 or newer.
scikit-learn
0.16.0 or newer.sklearn-pandas
0.0.10 or newer.sklearn2pmml
0.14.0 or newer.
Validating Python installation:
import joblib, sklearn, sklearn_pandas, sklearn2pmml
print(joblib.__version__)
print(sklearn.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)
The JPMML-SkLearn side of operations
- Java 1.8 or newer.
Installation
Enter the project root directory and build using Apache Maven:
mvn clean install
The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.8-SNAPSHOT.jar
, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar
.
Usage
A typical workflow can be summarized as follows:
- Use Python to train a model.
- Serialize the model in
pickle
data format to a file in a local filesystem. - Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.
The Python side of operations
Loading data to a pandas.DataFrame
object:
import pandas
df = pandas.read_csv("Iris.csv")
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]
First, creating a sklearn_pandas.DataFrameMapper
object, which performs column-oriented feature engineering and selection work:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain
column_preprocessor = DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])
Second, creating Transformer
and Selector
objects, which perform table-oriented feature engineering and selection work:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy
table_preprocessor = Pipeline([
("pca", PCA(n_components = 3)),
("selector", SelectorProxy(SelectKBest(k = 2)))
])
Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy
object.
Third, creating an Estimator
object:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(min_samples_leaf = 5)
Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline
object, and running the experiment:
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("columns", column_preprocessor),
("table", table_preprocessor),
("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)
Recording feature importance information in a pickle
data format-compatible manner:
classifier.pmml_feature_importances_ = classifier.feature_importances_
Embedding model verification data:
pipeline.verify(iris_X.sample(n = 15))
Storing the fitted PMMLPipeline
object in pickle
data format:
import joblib
joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)
Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.
The JPMML-SkLearn side of operations
Converting the pipeline pickle file pipeline.pkl.z
to a PMML file pipeline.pmml
:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml
Getting help:
java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.8-SNAPSHOT.jar --help
Documentation
Integrations:
- Training Scikit-Learn GridSearchCV StatsModels pipelines
- Converting Scikit-Learn H2O.ai pipelines to PMML
- Converting customized Scikit-Learn estimators to PMML
- Training Scikit-Learn StatsModels pipelines
- Upgrading Scikit-Learn XGBoost pipelines
- Training Python-based XGBoost accelerated failure time models
- Converting Scikit-Learn PyCaret 3 pipelines to PMML
- Training Scikit-Learn H2O.ai pipelines
- One-hot encoding categorical features in Scikit-Learn XGBoost pipelines
- Training Scikit-Learn TF(-IDF) plus XGBoost pipelines
- Converting Scikit-Learn TF(-IDF) pipelines to PMML
- Converting Scikit-Learn Imbalanced-Learn pipelines to PMML
- Converting logistic regression models to PMML
- Stacking Scikit-Learn, LightGBM and XGBoost models
- Converting Scikit-Learn GridSearchCV pipelines to PMML
- Converting Scikit-Learn TPOT pipelines to PMML
- Converting Scikit-Learn LightGBM pipelines to PMML
Extensions:
- Extending Scikit-Learn with feature cross-references
- Extending Scikit-Learn with UDF expression transformer
- Extending Scikit-Learn with CHAID models
- Extending Scikit-Learn with prediction post-processing
- Extending Scikit-Learn with outlier detector transformer
- Extending Scikit-Learn with date and datetime features
- Extending Scikit-Learn with feature specifications
- Extending Scikit-Learn with GBDT+LR ensemble models
- Extending Scikit-Learn with business rules model
Miscellaneous:
- Upgrading Scikit-Learn decision tree models
- Measuring the memory consumption of Scikit-Learn models
- Benchmarking Scikit-Learn against JPMML-Evaluator
- Analyzing Scikit-Learn feature importances via PMML
Archived:
License
JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.
Additional information
JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact info@openscoring.io