Awesome

Rubix ML - Titanic - Machine Learning from Disaster

Titanic - Machine Learning from Disaster

Content

Installation
Requirements
Recommended
Turorial
Introduction
Extracting the training Data
Preprocessing the training Data
Saving transformers
Model training
Saving the estimator
Extracting the test Data
Loading transformers
Preprocessing the test Data
Loading estimator
Making predictions
Saving predictions
Conclusion

An example Rubix ML project that predicts which passengers survived the Titanic shipwreck using a Random Forest clasiffier and a very famous dataset from a Kaggle competition. In this tutorial, you'll learn about classification and advanced preprocessing techniques. By the end of the tutorial, you'll be able to submit your own predictions to the Kaggle competition.

Difficulty: Medium
Training time: Minutes

From Kaggle:

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is titled train.csv and the other is titled test.csv.

Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.

The test.csv dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.

Using the patterns you find in the train.csv data, predict whether the other 418 passengers on board (found in test.csv) survived.

Installation

Clone the project locally using Composer:

$ composer create-project jenutka/titanic_php

Requirements

PHP 7.4 or above

Tutorial

Introduction

Kaggle is a platform that allows you to test your data science skills by engaging with contests. This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

We'll choose Random Forest as our learner since it offers good performance and is capable of handling both categorical and continuous features.

Note: The source code for this example can be found in the train.php and in predict.php file in project root.

Script desription

The script is separated into two parts:

train.php For extracting training data from csv, feature transformation, training and saving predicting model
predict.php For loading trained predicting model and for making and exporting predictions from unlabeled dataset

The training data are given to us in train.csv which has features and labels for training the model. We train the model from the whole dataset, because our testing data test.csv are unlabeled, so in this case we can only validate predictions with Kaggle competition.

Extracting the Data

Each feature is defined by column in train.csv. For our purpose we only choose preferable features with the most informative value for our model. These are continuos and categorical. For extraction from train.csv to dataset object we use Column Picker. As the last extracted feature we name our target (label) feature Survived.

use Rubix\ML\Extractors\CSV;
use Rubix\ML\Extractors\ColumnPicker;

$extractor = new ColumnPicker(new CSV('train.csv', true), [
    'Pclass', 'Age', 'Fare', 'SibSp', 'Parch', 'Sex', 'Embarked', 'Survived',
]);

Preprocessing the training Data

As in the *.csv file are missing values, we need to preprocess them for use with MissingDataImputer. For this purpose we use LambdaFunction in which we pass mapping function $toPlaceholder.

use Rubix\ML\Transformers\LambdaFunction;

$toPlaceholder = function (&$sample, $offset, $types) {
    foreach ($sample as $column => &$value) {
        if (empty($value) && $types[$column]->isContinuous()) {
            $value = NAN;
        }
        else if (empty($value) && $types[$column]->isCategorical()) {
            $value = '?';
        }
    }
};

The target values in train.csv are 0 and 1. Our training model can handle it as floating number so we should map these as categorical variable Dead and Survived.

$transformLabel = function ($label) {
    return $label == 0 ? 'Dead' : 'Survived';
};

For numerical variables we transform data with MinMaxNormalize.For categorical variable we use OneHotEncoder. For these two transformers and for MissingDataImputer we instantiate new objects.

use Rubix\ML\Transformers\MinMaxNormalizer;
use Rubix\ML\Transformers\OneHotEncoder;
use Rubix\ML\Transformers\MissingDataImputer;

$minMaxNormalizer = new MinMaxNormalizer();
$oneHotEncoder = new OneHotEncoder();
$imputer = new MissingDataImputer();

Finally we create the Labeled dataset and fit with our preprocessing functions.

use Rubix\ML\Datasets\Labeled;

$dataset = Labeled::fromIterator($extractor)
    ->apply(new NumericStringConverter())
    ->transformLabels($transformLabel);

$dataset->apply(new LambdaFunction($toPlaceholder, $dataset->types()))
    ->apply($imputer)
    ->apply($minMaxNormalizer)
    ->apply($oneHotEncoder);

Saving Transformers

Now because we want to apply the same fitted preprocessing on testing dataset test.csv and predicting part will be realized with separated script predict.php, we need to save our fitted transformers into serialized objects. For this purpose we create new Filesystem objects with using RBX file format.

use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Serializers\RBX;

$serializer->serialize($imputer)->saveTo(new Filesystem('imputer.rbx'));
$serializer->serialize($minMaxNormalizer)->saveTo(new Filesystem('minmax.rbx'));
$serializer->serialize($oneHotEncoder)->saveTo(new Filesystem('onehot.rbx'));

Model training

After we have prepared our data, we can train our predicting model. As estimator we use RandomForest which is an ensemble of ClassificationTrees which is good suited for our relatively small dataset.

use Rubix\ML\Classifiers\RandomForest;
use Rubix\ML\Classifiers\ClassificationTree;

$estimator = new RandomForest(new ClassificationTree(10), 500, 0.8, false);

$estimator->train($dataset);

Saving the estimator

Finally we save our predicting model for use with predict.php script. As in case with transformers we use Filesystem object with using RBX file format again. But now instead of serializing we use for saving predictive model PersistentModel object. For secure of overwriting existing model, we ask user for saving the new trained model.

use Rubix\ML\PersistentModel;

if (strtolower(readline('Save this model? (y|[n]): ')) === 'y') {
    $estimator = new PersistentModel($estimator, new Filesystem('model.rbx'));

    $estimator->save();

    $logger->info('Model saved as model.rbx');
}

Now we have finished our training part train.php, which we execute by calling it from the command line.

$ php train.php

Now we can move on creating predicting part predict.php

Extracting the test Data

For predicting part we need to extract our test data which don't contain labels. By extracting we name the same features as for training set, but we omit the target Survived.

use Rubix\ML\Extractors\ColumnPicker;

$extractor = new ColumnPicker(new CSV('test.csv', true), [
    'Pclass', 'Age', 'Fare', 'SibSp', 'Parch', 'Sex', 'Embarked',
]);

Loading transformers

For transforming our test dataset we need to use the transformation fitted on our training dataset. So we load and deserialize our previously saved persistors.

$persister_imputer = new Filesystem('imputer.rbx', true, new RBX());

$imputer = $persister_imputer->load()->deserializeWith(new RBX);

$persister_minMax = new Filesystem('minmax.rbx', true, new RBX());

$minMaxNormalizer = $persister_minMax->load()->deserializeWith(new RBX);

$persister_oneHot = new Filesystem('onehot.rbx', true, new RBX());

$oneHotEncoder = $persister_oneHot->load()->deserializeWith(new RBX);

Preprocessing the test Data

For testing data we need to create new Unlabeled dataset object in which we pass our $extractor. As we have loaded our fitted transformers, we can apply them on this dataset object. As in case of training data we use function $toPlaceholder to map our missing values so MissingDataImputer can handle missing values.

use Rubix\ML\Datasets\Unlabeled;
use Rubix\ML\Transformers\LambdaFunction;

$dataset = Unlabeled::fromIterator($extractor)
    ->apply(new NumericStringConverter());

$dataset->apply(new LambdaFunction($toPlaceholder, $dataset->types()))
    ->apply($imputer)
    ->apply($minMaxNormalizer)
    ->apply($oneHotEncoder);

Loading estimator

Now we can load our persisted RandomForest estimator into our script using the static load() method.

use Rubix\ML\PersistentModel;
use Rubix\ML\Persisters\Filesystem;
use Rubix\ML\Serializers\RBX;

$estimator = PersistentModel::load(new Filesystem('model.rbx'));

Making predictions

For making predictions on our testing unlabeled dataset we call the predict() method on our loaded estimator. We store our predicted classes under $predictions variable.

$predictions = $estimator->predict($dataset);

Saving predictions

Now we need to prepare our stored predictions into required format so we can submit it to Kaggle competition.

Firstly we map back our labels into 1 and O. For this we create function bin_mapper which we pass as parameter into built-in php function array_map.

$predictions = $estimator->predict($dataset);

function bin_mapper($v)
{
    if ($v==="Survived") {
        return "1";
    } else {
        return "0";
    }
}

$predictions_mapped = array_map('bin_mapper', $predictions);

Now we extract PassengerId column from test.csv. Now we create array $ids for column PassengerId. We apply array_unshift function on both columns. Next we instantiate CSV file predictions.csv and finaly export our two columns of data into it with array_transpose function.

$extractor = new ColumnPicker(new CSV('test.csv', true), ['PassengerId']);

$ids = array_column(iterator_to_array($extractor), 'PassengerId');

array_unshift($ids, 'PassengerId');
array_unshift($predictions_mapped, 'Survived');

$extractor = new CSV('predictions.csv');

$extractor->export(array_transpose([$ids, $predictions_mapped]));

Now we can run our prediction script by calling it from the command line.

$ php predict.php

After succesfully generating file predictions.csv we can submit it to our [Kaggle competition] (https://www.kaggle.com/competitions/titanic) and look at our result in the public leaderboard.

Conclusion

This tutorial describes the whole process of machine learning predicting with RubixML php library. We can take this example as a starting point to other improvements. For example we can apply advanced feature engineering to obtain more information to train the model. As next we examine the predicting model itself. We can try to tune given hyperparametres of the model or we can prove performance of other classifiers (for example Support vector machine or neural network).

As next activity we can try to deploy our predicting model to our webpage or server, where the visitor can after filling the form with our features get information about his/her situation in case of embarking Titanic.

This example can also serve as a template of workflow which can be apllied on another maching learning problem.

License

The code is licensed MIT and the tutorial is licensed CC BY-NC 4.0.