Home

Awesome

Data Modeler

Build Status Code Climate Gem Version Total downloads License

Using machine learning, create generative models based on your data.

Installation

Add gem 'data_modeler' to your Gemfile then $ bundle, or install manually with $ gem install data_modeler.

If you're new to Ruby or Bundler, check these detailed installation instructions first.

Full documentation

I wish for my code to stay well documented. If you find the documentation lacking or outdated, please do let me know. You can find it here.

Getting started

Obtaining a working configuration on example data

Make a copy of /spec/example for you to play with. The config*.rb files are configuration examples. The configuration is written in a simple Ruby Hash, and the files themselves can be directly executed with (i.e. run ruby config_01.rb) thanks to the few lines at the bottom.

The .csv files are examples of the format the data must be pre-processed into beforehand: a CSV table with a numeric time as first column, followed by one column for each of the time series available. The data should be complete (i.e. no missing values) and already normalized (depending on the model of choice). The file prepare_demo_csv can help you getting started on the task, as it was used to generate the demo CSV.

Start by just running one of the configurations, then play around with the config and customize them to your taste. And off you go!

Understanding the results

Running a config file will create a folder holding the results; the path can be customized in the config file.
Note that DataModeler#id_from returns a numeric ID from the end of a string (e.g. file name), saving you from forgetting to update the output folder after creating a new config by copy.

Inside the results folder you will find a result file (CSV) for each run. They follow the naming convention tpredobs_<nrun>.csv as to remind their internal structure:

Loading this raw result data allows for easy calculation of residuals and statistics, and to plot your predictions against the ground truth.

Customizing your experiment

Outdated documentation is often worse than lack of documentation. To understand all configuration options, consider the following:

This means that to know all available options you should rely on a previous config file, plus to the documentation (or implementation) of the initialize function of the model of choice (should be small).

Leveraging time series data

There are three settings under :tset in the config which may be cryptic: ninput_points, tspread and look_ahead. Names can change in the future as I found it hard to name these three, please open an issue if I forget to modify this (or if you have suggestions).

If you don't work with time series, just set them to [1,0,0], use a line counter for time, and ignore the following. These three only make sense if the data is composed of aligned time series, with a numeric column time -- its unit will also be the unit for tspread and look_ahead.

The data needs to be indexed (i.e. no repetitions) and sorted by time. This implies that different data "lines" in the following explanation have different time values.

Example configurations:

Important: from each line, only the data coming from the listed input time series is considered for input, while the target time series list is used to construct the output.

Example inputs and targets, considering t0 the "current" time for a given iteration:

Contributing

Suggestions / requests

Feel free to open new issues. I mean it. We can work together from there.

Adding new models

This system has by design a plug-in architecture. To add your own models, you just need to create a new wrapper in lib/data_modeler/model:

Ideally, a DataModeler::Model should be a wrapper around an external independent functionality: keep it as compact as possible. To implement the interface you can use BDD on the spec, which verifies both the availability of the interface and basic modeling capabilities.

Remember to update lib/data_modeler.rb to load your file, and add an option to select it in lib/data_modeler/model/selector.rb

THEN: please do propose a pull requests! Share your work with the community!
Even if you think it's not polished enough: I'll help out before accepting.

License

The gem is available as open source under the terms of the MIT License.

Notes

This build specifically leverages time series. Further work on data preparation will be released as a separate project.