Home

Awesome

ppscore - a Python implementation of the Predictive Power Score (PPS)

From the makers of bamboolib - a GUI for pandas DataFrames

If you don't know yet what the Predictive Power Score is, please read the following blog post:

RIP correlation. Introducing the Predictive Power Score

The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).

Installation

You need Python 3.6 or above.

From the terminal (or Anaconda prompt in Windows), enter:

pip install -U ppscore

Getting started

The examples refer to the newest version (1.2.0) of ppscore. See changes

First, let's create some data:

import pandas as pd
import numpy as np
import ppscore as pps

df = pd.DataFrame()
df["x"] = np.random.uniform(-2, 2, 1_000_000)
df["error"] = np.random.uniform(-0.5, 0.5, 1_000_000)
df["y"] = df["x"] * df["x"] + df["error"]

Based on the dataframe we can calculate the PPS of x predicting y:

pps.score(df, "x", "y")

We can calculate the PPS of all the predictors in the dataframe against a target y:

pps.predictors(df, "y")

Here is how we can calculate the PPS matrix between all columns:

pps.matrix(df)

Visualization of the results

For the visualization of the results you can use seaborn or your favorite viz library.

Plotting the PPS predictors:

import seaborn as sns
predictors_df = pps.predictors(df, y="y")
sns.barplot(data=predictors_df, x="x", y="ppscore")

Plotting the PPS matrix:

(This needs some minor preprocessing because seaborn.heatmap unfortunately does not accept tidy data)

import seaborn as sns
matrix_df = pps.matrix(df)[['x', 'y', 'ppscore']].pivot(columns='x', index='y', values='ppscore')
sns.heatmap(matrix_df, vmin=0, vmax=1, cmap="Blues", linewidths=0.5, annot=True)

API

ppscore.score(df, x, y, sample=5_000, cross_validation=4, random_seed=123, invalid_score=0, catch_errors=True)

Calculate the Predictive Power Score (PPS) for "x predicts y"

Parameters

Returns

ppscore.predictors(df, y, output="df", sorted=True, **kwargs)

Calculate the Predictive Power Score (PPS) for all columns in the dataframe against a target (y) column

Parameters

Returns

ppscore.matrix(df, output="df", sorted=False, **kwargs)

Calculate the Predictive Power Score (PPS) matrix for all columns in the dataframe

Parameters

Returns

Calculation of the PPS

If you are uncertain about some details, feel free to jump into the code to have a look at the exact implementation

There are multiple ways how you can calculate the PPS. The ppscore package provides a sample implementation that is based on the following calculations:

Learning algorithm

As a learning algorithm, we currently use a Decision Tree because the Decision Tree has the following properties:

We differentiate the exact implementation based on the data type of the target column:

Please note that we prefer a general good performance on a wide variety of use cases over better performance in some narrow use cases. If you have a proposal for a better/different learning algorithm, please open an issue

However, please note why we actively decided against the following algorithms:

Data preprocessing

Even though the Decision Tree is a very flexible learning algorithm, we need to perform the following preprocessing steps if a column represents categoric values - that means it has the pandas dtype object, category, string or boolean.‌

Choosing the prediction case

This logic was updated in version 1.0.0.

The choice of the case (classification or regression) has an influence on the final PPS and thus it is important that the correct case is chosen. The case is chosen based on the data types of the columns. That means, e.g. if you want to change the case from regression to classification that you have to change the data type from float to string.

Here are the two main cases:

Cases and their score metrics​

Each case uses a different evaluation score for calculating the final predictive power score (PPS).

Regression

In case of an regression, the ppscore uses the mean absolute error (MAE) as the underlying evaluation metric (MAE_model). The best possible score of the MAE is 0 and higher is worse. As a baseline score, we calculate the MAE of a naive model (MAE_naive) that always predicts the median of the target column. The PPS is the result of the following normalization (and never smaller than 0):

PPS = 1 - (MAE_model / MAE_naive)

Classification

If the task is a classification, we compute the weighted F1 score (wF1) as the underlying evaluation metric (F1_model). The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The weighted F1 takes into account the precision and recall of all classes weighted by their support as described here. As a baseline score (F1_naive), we calculate the weighted F1 score for a model that always predicts the most common class of the target column (F1_most_common) and a model that predicts random values (F1_random). F1_naive is set to the maximum of F1_most_common and F1_random. The PPS is the result of the following normalization (and never smaller than 0):

PPS = (F1_model - F1_naive) / (1 - F1_naive)

Special cases

There are various cases in which the PPS can be defined without fitting a model to save computation time or in which the PPS cannot be calculated at all. Those cases are described below.

Valid scores

In the following cases, the PPS is defined but we can save ourselves the computation time:

Invalid scores and other errors

In the following cases, the PPS is not defined and the score is set to invalid_score:

Citing ppscore

DOI

About

ppscore is developed by 8080 Labs - we create tools for Python Data Scientists. If you like ppscore you might want to check out our other project bamboolib - a GUI for pandas DataFrames