Awesome

<img width="400" style="width: 50% !important; max-width: 400px;" src="assets/getml_logo_dark.png#gh-dark-mode-only" /> <img width="400" style="width: 50% !important; max-width: 400px;" src="assets/getml_logo.png#gh-light-mode-only" /> getML combines feature learning with AutoML to build end-to-end prediction pipelines <a href="https://getml.com/latest/contact" target="_blank"> <img src="https://img.shields.io/badge/schedule-a_meeting-blueviolet.svg" /></a> <a href="mailto:hello@getml.com" target="_blank"> <img src="https://img.shields.io/badge/contact-us_by_mail-orange.svg" /></a>

Introduction

This repository contains different Jupyter Notebooks to demonstrate the capabilities of getML in the realm of machine learning on relational data-sets in various domains. getML and its feature engineering algorithms (FastProp, Multirel, Relboost, RelMT), its predictors (LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor) and its hyperparameter optimizer (RandomSearch, LatinHypercubeSearch, GaussianHyperparameterSearch), are benchmarked against competing tools in similar categories, like featuretools, tsfresh, and prophet. While FastProp usually outperforms the competition in terms of runtime and resource requirements, the more sophisticated algorithms (Multirel, Relboost, RelMT), which are part of the Enterprise edition, often lead to even higher accuracy while maintaining low resource requirements. The demonstrations are done on publicly available data-sets, which are standardly used for such comparisons.

Introduction
Table of Contents
Usage
- Reading Online
- Experimenting Locally
  - Using Docker
  - On the Machine (Linux/x64 & arm64)
Notebooks

Usage

The provided notebooks can be checked and used in different ways.

Reading Online

As github renders the notebooks, they can each be viewed by just opening and scrolling through them. For convenience, the output of each cells execution is included.

Experimenting Locally

To experiment with the notebooks, such as playing with different pipelines and predictors, it is best to run them on a local machine. Linux users with an x64 architecture can choose from one of the options provided below. Soon, we will offer a simple, container-based solution compatible with all major systems (Windows, Mac) and will also support ARM-based architectures.

Using Docker

There are a docker-compose.yml and a Dockerfile for easy usage provided.

Simply clone this repository and run the docker command to start the notebooks service. The image it depends on will be build if it is not already available.

$ git clone https://github.com/getml/getml-demo.git  
$ docker compose up notebooks

To open Jupyter Lab in the browser, look for the following lines in the output and copy-paste them in your browser:

Or copy and paste one of these URLs:

http://localhost:8888/lab?token=<generated_token>

After the first getml.engine.launch(...) is executed and the Engine is started, the corresponding Monitor can be opened in the browser under

http://localhost:1709/#/token/token

[!NOTE]
Using alternatives to Docker Desktop like

Podman,

Podman Desktop or

Rancher Desktop with a container engine like dockerd(moby) or containerd(nerdctl)

allows bind-mounting the notebooks in a user-writeable way (this might need to be included: userns_mode: keep-id) instead of having to COPY them in. In combination with volume-binding /home/user/.getML/logs and /home/user/.getML/projects, runs and changes can be persisted across containers.

On the Machine (Linux/x64 & arm64)

Alternatively, getML and the notebooks can be run natively on the local Linux machine by having certain software installed, like Python and some Python libraries, Jupyter-Lab and the getML Engine. The getML Python library provides an Engine version without Enterprise features. In order to replicate Enterprise functionalities in the notebooks, you may obtain an Enterprise trial version.

The following commands will set up a Python environment with necessary Python libraries and the getML Enterprise trial version, and Jupyter-Lab

$ git clone https://github.com/getml/getml-demo.git  
$ cd getml-demo  
$ pipx install hatch
$ hatch env create
$ hatch shell
$ pip install -r requirements.txt
$ jupyter-lab

[!TIP]
Install the Enterprise trial version via the Install getML on Linux guide to try the Enterprise features.

With the last command, Jupyter-Lab should automatically open in the browser. If not, look for the following lines in the output and copy-paste it in your browser:

Or copy and paste one of these URLs:

http://localhost:8888/lab?token=<generated_token>

After the first getml.engine.launch(...) is executed and the Engine is started, the corresponding Monitor can be opened in the browser under

http://localhost:1709/#/token/token

Notebooks

This repository contains various demonstrational projects to help getting started with relational learning and getML. They cover different aspects of the software, and can serve as documentation or as blueprints for own projects.

Each project solves a typical data science problem in a specific domain. You can either choose a project by domain or by the underlying machine learning problem, e.g. binary classification on a time series or regression using a relational data scheme involving many tables.

Overview

	Task	Data	Size	Domain
AdventureWorks: Predicting customer churn	Classification	Relational	71 tables, 233 MB	Commerce
Air pollution prediction	Regression	Multivariate time series	1 table, 41k rows	Environment
Disease lethality prediction	Classification	Relational	3 tables, 22 MB	Health
Baseball (Lahman): Predicting salaries	Regression	Relational	25 tables, 74 MB	Sports
Expenditure categorization	Classification	Relational	3 tables, 150 MB	E-commerce
CORA: Categorizing academic studies	Classification	Relational	3 tables, 4.6 MB	Academia
Traffic volume prediction (LA)	Regression	Multivariate time series	1 table, 47k rows	Transportation
Formula 1 (ErgastF1): Predicting the winner	Classification	Relational	13 tables, 56 MB	Sports
IMDb: Predicting actors' gender	Classification	Relational with text	7 tables, 477.1 MB	Entertainment
Traffic volume prediction (I94)	Regression	Multivariate time series	1 table, 24k rows	Transportation
Financial: Loan default prediction	Classification	Relational	8 tables, 60 MB	Financial
MovieLens: Predicting users' gender	Classification	Relational	7 tables, 20 MB	Entertainment
Occupancy detection	Classification	Multivariate time series	1 table, 32k rows	Energy
Order cancellation	Classification	Relational	1 table, 398k rows	E-commerce
Predicting a force vector from sensor data	Regression	Multivariate time series	1 table, 15k rows	Robotics
Seznam: Predicting the transaction volume	Regression	Relational	4 tables, 147 MB	E-commerce
SFScores: Predicting health check scores	Regression	Relational	3 tables, 9 MB	Restaurants
Stats: Predicting users' reputation	Regression	Relational	8 tables, 658 MB	Internet

Descriptions

<details> <summary>Adventure Works - Predicting customer churn</summary>

In the notebook, we demonstrate how getML can be used for a customer churn project using a synthetic dataset of a fictional company. We also benchmark getML against featuretools.

AdventureWorks is a fictional company, that sells bicycles. It is used by Microsoft to showcase how its MS SQL Server can be used to manage business data. Since the dataset resembles a real-world customer database and it is open-source, we use it to showcase, how getML can be used for a classic customer churn project (real customer databases are not easily available for the purposes of showcasing and benchmarking, for reasons of data privacy).

Prediction type: Classification model
Domain: Customer loyalty
Prediction target: churn
Population size: 19704

	Benchmarks	Results	getML	other
AdventureWorks: Predicting customer churn	featuretools	AUC	97.8%	featuretools 96.8%
Air pollution prediction	featuretools, tsfresh	R-squared	61.0%	next best 53.7%
Baseball (Lahman): Predicting salaries	featuretools	R-squared	83.7%	featuretools 78.0%
CORA: Categorizing academic studies	Academic literature: RelF, LBP, EPRN, PRN, ACORA	Accuracy	89.9%	next best 85.7%
Traffic volume prediction (LA)	Prophet (fbprophet), tsfresh	R-squared	76%	next best 67%
Formula 1 (ErgastF1): Predicting the winner	featuretools	AUC	92.6%	featuretools 92.0%
IMDb: Predicting actors' gender	Academic literature: RDN, Wordification, RPT	AUC	91.34%	next best 86%
Traffic volume prediction (I94)	Prophet (fbprophet)	R-squared	98.1%	prophet 83.3%
MovieLens: Predicting users' gender	Academic literature: PRM, MBN	Accuracy	81.6%	next best 69%
Occupancy detection	Academic literature: Neural networks	AUC	99.8%	next best 99.6%
Seznam: Predicting the transaction volume	featuretools	R-squared	78.2%	featuretools 63.2%
SFScores: Predicting health check scores	featuretools	R-squared	29.1%	featuretools 26.5%
Stats: Predicting users' reputation	featuretools	R-squared	98.1%	featuretools 96.6%

	Faster vs. featuretools	Faster vs. tsfresh	Remarks
Air pollution	~65x	~33x	The predictive accuracy can be significantly improved by using RelMT instead of propositionalization approaches, please refer to this notebook.
Dodgers	~42x	~75x	The predictive accuracy can be significantly improved by using the mapping preprocessor and/or more advanced feature learning algorithms, please refer to this notebook.
Interstate94	~55x
Occupancy	~87x	~41x
Robot	~162x	~77x

	Official page
AdventureWorks: Predicting customer churn	AdventureWorks
Baseball (Lahman): Predicting salaries	Lahman
CORA: Categorizing academic studies	CORA
Financial: Loan default prediction	Financial
Formula 1 (ErgastF1): Predicting the winner	ErgastF1
IMDb: Predicting actors' gender	IMDb
MovieLens: Predicting users' gender	MovieLens
Seznam: Predicting the transaction volume	Seznam
SFScores: Predicting health check scores	SFScores
Stats: Predicting users' reputation	Stats

Awesome

Introduction

Table of Contents

Usage

Reading Online

Experimenting Locally

Using Docker

On the Machine (Linux/x64 & arm64)

Notebooks

Overview

Descriptions

Quick access by grouping by

Benchmarks

FastProp Benchmarks

Further Benchmarks in the Relational Dataset Repository