Home

Awesome

<p align="left"> <img width=15% src="https://dai.lids.mit.edu/wp-content/uploads/2018/06/Logo_DAI_highres.png" alt="DAI" /> <i>An open source project from Data to AI Lab at MIT.</i> </p> <p align="left"> <img width=20% src="https://dai.lids.mit.edu/wp-content/uploads/2019/03/GreenGuard.png" alt="Draco" /> </p> <p align="left"> AutoML for Time Series. </p>

PyPI Shield Tests Downloads Binder

<!-- [![Coverage Status](https://codecov.io/gh/sintel-dev/Draco/branch/master/graph/badge.svg)](https://codecov.io/gh/sintel-dev/Draco) -->

Draco

Overview

The Draco project is a collection of end-to-end solutions for machine learning problems commonly found in time series monitoring systems. Most tasks utilize sensor data emanating from monitoring systems. We utilize the foundational innovations developed for automation of machine Learning at Data to AI Lab at MIT.

The salient aspects of this customized project are:

Resources

Install

Requirements

Draco has been developed and runs on Python 3.6, 3.7 and 3.8.

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where you are trying to run Draco.

Download and Install

Draco can be installed locally using pip with the following command:

pip install draco-ml

This will pull and install the latest stable release from PyPi.

If you want to install from source or contribute to the project please read the Contributing Guide.

Data Format

The minimum input expected by the Draco system consists of the following two elements, which need to be passed as pandas.DataFrame objects:

Target Times

A table containing the specification of the problem that we are solving, which has three columns:

turbine_idcutoff_timetarget
0T12001-01-02 00:00:000
1T12001-01-03 00:00:001
2T22001-01-04 00:00:000

Readings

A table containing the signal data from the different sensors, with the following columns:

turbine_idsignal_idtimestampvalue
0T1S12001-01-01 00:00:001
1T1S12001-01-01 12:00:002
2T1S12001-01-02 00:00:003
3T1S12001-01-02 12:00:004
4T1S12001-01-03 00:00:005
5T1S12001-01-03 12:00:006
6T1S22001-01-01 00:00:007
7T1S22001-01-01 12:00:008
8T1S22001-01-02 00:00:009
9T1S22001-01-02 12:00:0010
10T1S22001-01-03 00:00:0011
11T1S22001-01-03 12:00:0012

Turbines

Optionally, a third table can be added containing metadata about the turbines. The only requirement for this table is to have a turbine_id field, and it can have an arbitraty number of additional fields.

turbine_idmanufacturer.........
0T1Siemens.........
1T2Siemens.........

CSV Format

A part from the in-memory data format explained above, which is limited by the memory allocation capabilities of the system where it is run, Draco is also prepared to load and work with data stored as a collection of CSV files, drastically increasing the amount of data which it can work with. Further details about this format can be found in the project documentation site.

Quickstart

In this example we will load some demo data and classify it using a Draco Pipeline.

1. Load and split the demo data

The first step is to load the demo data.

For this, we will import and call the draco.demo.load_demo function without any arguments:

from draco.demo import load_demo

target_times, readings = load_demo()

The returned objects are:

Once we have loaded the target_times and before proceeding to training any Machine Learning Pipeline, we will have split them in 2 partitions for training and testing.

In this case, we will split them using the train_test_split function from scikit-learn, but it can be done with any other suitable tool.

from sklearn.model_selection import train_test_split

train, test = train_test_split(target_times, test_size=0.25, random_state=0)

Notice how we are only splitting the target_times data and not the readings. This is because the pipelines will later on take care of selecting the parts of the readings table needed for the training based on the information found inside the train and test inputs.

Additionally, if we want to calculate a goodness-of-fit score later on, we can separate the testing target values from the test table by popping them from it:

test_targets = test.pop('target')

2. Exploring the available Pipelines

Once we have the data ready, we need to find a suitable pipeline.

The list of available Draco Pipelines can be obtained using the draco.get_pipelines function.

from draco import get_pipelines

pipelines = get_pipelines()

The returned pipeline variable will be list containing the names of all the pipelines available in the Draco system:

['lstm',
 'lstm_with_unstack',
 'double_lstm',
 'double_lstm_with_unstack']

For the rest of this tutorial, we will select and use the pipeline lstm_with_unstack as our template.

pipeline_name = 'lstm_with_unstack'

3. Fitting the Pipeline

Once we have loaded the data and selected the pipeline that we will use, we have to fit it.

For this, we will create an instance of a DracoPipeline object passing the name of the pipeline that we want to use:

from draco.pipeline import DracoPipeline

pipeline = DracoPipeline(pipeline_name)

And then we can directly fit it to our data by calling its fit method and passing in the training target_times and the complete readings table:

pipeline.fit(train, readings)

4. Make predictions

After fitting the pipeline, we are ready to make predictions on new data by calling the pipeline.predict method passing the testing target_times and, again, the complete readings table.

predictions = pipeline.predict(test, readings)

5. Evaluate the goodness-of-fit

Finally, after making predictions we can evaluate how good the prediction was using any suitable metric.

from sklearn.metrics import f1_score

f1_score(test_targets, predictions)

What's next?

For more details about Draco and all its possibilities and features, please check the project documentation site Also do not forget to have a look at the tutorials!