Home

Awesome

data-formatter

Takes in a .csv file and formats it to be ready for machine learning using scikit-learn or machineJS

What it does

data-formatter is designed to take care of the chores of machine learning to let you focus on the fun stuff!

How to label each column:

Each column must have a label. Your options are few- this is designed to be easy!

Column label options:

  1. ID
  2. Output Regression
  3. Output Category (this is the one to use for numer.ai's early competitions)
  4. Output Multi-Category
  5. Categorical
  6. Continuous
  7. Ignore
  8. Validation Split (used for the validation column from numer.ai, rarely used otherwise)
  9. NLP (please remove all commas and quotes from this column)

For more detailed info, see below.

Format of Input File:

  1. .csv file
  2. The first row holds information describing each column. Specifically, it must specify:

All other columns must be labeled as holding either Categorical or Continuous data:

  1. Next row (the second row) must be a header row containing the names of each column.
  2. Make sure there are no empty rows!
  3. Make sure each row has the same number of columns (even if those columns are blank, they must exist).
  4. Make sure any strings are formatted using UTF-8 (don't worry about this one unless you get an odd error message).

Installation:

This comes pre-bundled with machineJS. To use it for other projects:

To include as a dependency for a specific repo:

npm install data-formatter

To use from the command line anywhere in your system:

npm install -g data-formatter

Some key benefits:

Novice?

Does some of that make your head spin? Have no idea what one (or more) of those bullet points means? No worries, that's the entire point of letting a library do this work for you!

Expert?

Did any of the above get your heart racing and make you want to dive in to customize for your own project or kaggle competition? Awesome, follow along with mainPythonProcess.py and customize to your heart's content, while still having in place a structure to automate the process for you!

Format of Output Files:

The formatted data will be broken out into a number of different files, to be compatible with scikit-learn's API:

How to Use Outside of machineJS:

Again, this is baked into machineJS, but if you're using it in a different project:

  1. Add a dataDescription row to the top of your training data (more info in a following section)

Within node.js code using require:

  1. Require the module: var df = require('data-formatter');
  2. Invoke with an object that has trainingData and testingData properties, and an optional callback:
df({
  trainingData: full/absolute/path/to/training/data.csv,
  testingData:  full/absolute/path/to/testing/data.csv
}, callbackFunc);

The optional callback will be called once all data formatting has completed.

From the command line

data-formatter relative/path/to/training/data.csv relative/path/to/testing/data.csv

Make sure that you have used the -g flag when installing using npm if you want to use data-formatter from the command line.

API Documentation:

an args object with the following properties:

trainingData

A full, absolute path to a .csv file. See above for more info on adding an additional dataDescription row to the .csv file itself above the header row.

testingData

The testing data. This file is assumed to only have a header row, not a dataDescription row. The columns must be in the same order as they are for the trainingData file. This is almost always the case anyways.

joinFileName [OPTIONAL]

A full, absolute path to a .csv file that you would like to join in with the testing and training datasets. This file must have both a dataDescription and a header row. By default, it will be joined on any value in the headerRow that is shared across our training/testing dataset, and the join file.

outputFolder [OPTIONAL]

This property of the args object is optaional. If included, all formatted files will be written to this folder. This folder will be created if it does not exist already. DEFAULT: If a value is not passed in, this will default to creating a folder called data-formatterResults in whichever directory this library is invoked in. This is designed to make files easy to find if, say, you invoke this library from a directory where you are already working on a machine learning project.

callback [OPTIONAL]

After the args object, you may choose to pass in a callback function that will be invoked once training is done. This parameter is optional. If provided, the callback function will be invoked with an object containing the file paths to all of the formatted data files created.

keepAllFeatures [OPTIONAL]

If you do not want to perform any feature selection, and keep all the features (both the ones in the original training data, and the ones created by data-formatter), pass in true for this flag

allFeatureCombinations [OPTIONAL]

This is still a beta feature. If you want to try adding all possible combinations of continuous features together, set this flag to true. Since it creates all possible combinations of all the continuous features, this can rapidly create a memory problem, and should only be used on small datasets, or if you have a ton of RAM.

Using from the command line

As of the 1.2 release, data-formatter can be invoked right from the command line.

Installation

npm install -g data-formatter

Note the "-g" flag directing npm to install the module globally. This makes it available from the command line throughout your entire file directory.

Invoking from the command line

data-formatter path/to/training/data.csv path/to/testing/data.csv

The formatted data files will be written into whichever directory you invoke data-formatter from.

Other Random Info

machineJS

If you find this library useful, you might want to check out machineJS, which helps reduce the drudge work of other parts of the machine learning process!

Contributing- yes please!

There are few things that make me as happy as reading through Pull Requests over a morning espresso :)

Starring- yes please!

I've had a great time building this out so far. If you find it useful too, let me know by starring it!