Home

Awesome

JPMML-XGBoost Build Status

Java library and command-line application for converting XGBoost models to PMML.

Features

Supports all XGBoost versions 0.4 through 2.0(.3).

News and Updates

See the NEWS.md file.

Prerequisites

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-xgboost/target/pmml-xgboost-1.8-SNAPSHOT.jar, and an executable uber-JAR file pmml-xgboost-example/target/pmml-xgboost-example-executable-1.8-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

  1. Use XGBoost to train a model.
  2. Save the model and the associated feature map to files in a local filesystem.
  3. Use the JPMML-XGBoost command-line converter application to turn those two files to a PMML file.

The XGBoost side of operations

Training a binary classification model using the Audit.csv dataset.

R language

library("r2pmml")
library("xgboost")

df = read.csv("Audit.csv", stringsAsFactors = TRUE)

# Three continuous features, followed by five categorical features
X = df[c("Age", "Hours", "Income", "Education", "Employment", "Gender", "Marital", "Occupation")]
y = df["Adjusted"]

audit.formula = formula("~ . - 1")
audit.frame = model.frame(audit.formula, data = X, na.action = na.pass)
# Define rules for binarizing categorical features into binary indicator features
audit.contrasts = lapply(X[sapply(X, is.factor)], contrasts, contrasts = FALSE)
# Perform binarization
audit.matrix = model.matrix(audit.formula, data = audit.frame, contrasts.arg = audit.contrasts)

# Generate feature map based on audit.frame (not audit.matrix), because data.frame holds richer column meta-information than matrix
audit.fmap = r2pmml::as.fmap(audit.frame)
r2pmml::write.fmap(audit.fmap, "Audit.fmap")

audit.xgb = xgboost(data = audit.matrix, label = as.matrix(y), objective = "binary:logistic", nrounds = 131)
xgb.save(audit.xgb, "XGBoostAudit.model")

Python language - Learning API

Using an Audit.fmap feature map file (works with any XGBoost version):

from sklearn2pmml.xgboost import make_feature_map
from xgboost import DMatrix

import pandas
import xgboost

df = pandas.read_csv("Audit.csv")

# Three continuous features, followed by five categorical features
X = df[["Age", "Hours", "Income", "Education", "Employment", "Gender", "Marital", "Occupation"]]
y = df["Adjusted"]

# Convert categorical features into binary indicator features
X = pandas.get_dummies(data = X, prefix_sep = "=", dtype = bool)

audit_fmap = make_feature_map(X, enable_categorical = False)
audit_fmap.save("Audit.fmap")

audit_dmatrix = DMatrix(data = X, label = y)

audit_xgb = xgboost.train(params = {"objective" : "binary:logistic"}, dtrain = audit_dmatrix, num_boost_round = 131)
audit_xgb.save_model("XGBoostAudit.model")

The same, but using an embedded feature map (works with XGBoost 1.4 and newer):

from xgboost import DMatrix

import pandas
import xgboost

def to_fmap_type(dtype):
    # Continuous integers
    if dtype == "int":
        return "int"
    # Continuous floats
    elif dtype == "float":
        return "float"
    # Binary indicators (ie. 0/1 values) generated by pandas.get_dummies(X)
    elif dtype == "bool":
        return "i"
    else:
        raise ValueError(dtype)

df = pandas.read_csv("Audit.csv")

# Three continuous features, followed by five categorical features
X = df[["Age", "Hours", "Income", "Education", "Employment", "Gender", "Marital", "Occupation"]]
y = df["Adjusted"]

# Convert categorical features into binary indicator features
X = pandas.get_dummies(data = X, prefix_sep = "=", dtype = bool)

feature_names = X.columns.values
feature_types = [to_fmap_type(dtype) for dtype in X.dtypes]

# Constructing a DMatrix with explicit feature names and feature types
audit_dmatrix = DMatrix(data = X, label = y, feature_names = feature_names, feature_types = feature_types)

audit_xgb = xgboost.train(params = {"objective" : "binary:logistic"}, dtrain = audit_dmatrix, num_boost_round = 131)
audit_xgb.save_model("XGBoostAudit.model")

Python language - Scikit-Learn API

Using an Audit.fmap feature map file (works with any XGBoost version):

from sklearn.preprocessing import LabelEncoder
from sklearn2pmml.xgboost import make_feature_map
from xgboost.sklearn import XGBClassifier

import pandas

df = pandas.read_csv("Audit.csv")

# Three continuous features, followed by five categorical features
X = df[["Age", "Hours", "Income", "Education", "Employment", "Gender", "Marital", "Occupation"]]
y = df["Adjusted"]

# Convert categorical features into binary indicator features
X = pandas.get_dummies(data = X, prefix_sep = "=", dtype = bool)

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

audit_fmap = make_feature_map(X, enable_categorical = False)
audit_fmap.save("Audit.fmap")

classifier = XGBClassifier(objective = "binary:logistic", n_estimators = 131)
classifier.fit(X, y)

audit_xgb = classifier.get_booster()
audit_xgb.save_model("XGBoostAudit.model")

The JPMML-XGBoost side of operations

Converting the model file XGBoostAudit.model (binary data format) together with the associated feature map file Audit.fmap to a PMML file XGBoostAudit.pmml:

java -jar pmml-xgboost-example/target/pmml-xgboost-example-executable-1.8-SNAPSHOT.jar --model-input XGBoostAudit.model --fmap-input Audit.fmap --target-name Adjusted --pmml-output XGBoostAudit.pmml

If the XGBoost model contains an embedded feature map, then the --fmap-input command-line option may be omitted.

Getting help:

java -jar pmml-xgboost-example/target/pmml-xgboost-example-executable-1.8-SNAPSHOT.jar --help

Documentation

License

JPMML-XGBoost is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-XGBoost in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-XGBoost available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-XGBoost is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io