Home

Awesome

thunder-regression

Latest Version Build Status

algorithms for mass univariate regression

Mass univariate regression is the process of independently regressing multiple response variables against a single set of explantory features. It is common in any domain in which a lage number of response variables are measured, and fitting large collections of such models can benefit significantly from parallelization.

This package provides a simple API for fitting these kinds of models. It provides a collection of algorithms for performing different types of mass regression, all following the scikit-learn style. It also supports providing custom algorithms directly from scikit-learn. The algorithms are fit to data, returning a fitted model that contains regression coefficients and allows for prediction and scoring on new data. Compatible with Python 2.7+ and 3.4+. Works well alongside thunder and supprts parallelization via spark, but can also be used as a standalone module on local numpy arrays.

installation

pip install thunder-regression

example

In this example we'll create data and fit a collection of models

# generate data

from sklearn.datasets import make_regression
X, Y = make_regression(n_samples=100, n_features=3, n_informative=3, n_targets=10, noise=1.0)

# create and fit the model

from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)
model = algorithm.fit(X, Y.T)

After fitting, model.betas is an array with the 3 coefficients for each of 10 response variables.

usage

Import and construct an algorithm

from regression import LinearRegression
algorithm = LinearRegression(fit_intercept=False)

Fit the algorithm to data in the form of a samples x features design matrix X and a targets x samples response matrix Y.

model = algorithm.fit(X, Y)

The results of the fit are accessible on the fitted model, and the model can be used to score new data

betas = model.betas
rsq = model.score(X, Y)

For all methods, X should be a local numpy array, and Y can be either a local numpy array, a bolt array, or a thunder Series object.

api

algorithm

All algorithms have the following methods:

algorithm.fit(X, Y)

Fit the algorithm to data

model

The result of fitting an algorithm is a model with the following properties and methods:

model.betas

Array of regression coefficients, dimensions targets x features. If an intercept was fit, it will be the the first feature.

model.betas_and_scores

Array of regression coefficients, followed by prediction scores on the fitted data, dimensions targets x (feature + 1). If an intercept was fit, it will be the the first feature.

model.models

Array of individual fitted models, dimensions 1 x targets.

model.coef_

Array of coefficients, not including a possible intercept term, for consistency with scikit-learn.

model.intercept_

Array of intercepts, for consistency with scikit-learn. If no intercepts were fit, all will have values 0.0.

model.predict(X)

Predicts the response to new inputs.

model.score(X, Y)

Computes the goodness of fit (r-squared, unless otherwise stated) of the model for given data

model.predict_and_score(X, Y)

Simultaneously computes the results of predict(X) and score(X, Y)

list of algorithms

Here are all the algorithms currently available.

LinearRegression(fit_intercept=False, normalize=False)

Linear regression through ordinary least squares as implemented in scikit-learn's LinearRegression algorithm.

CustomRegression(algorithm)

Use a custom regression algorithm in a mass regression analysis. The provided algorithm should operate on single response variables, and must conform to the scikit-learn API as follows

This allows you to define an algorithm in scikit-learn and then wrap it for mass fitting, for example

from regression import CustomRegression
from sklearn.linear_model import LassoCV
algorithm = CustomRegression(LassoCV(normalize=True, fit_intercept=False))
model = algorithm.fit(X, Y)

tests

Run tests with

py.test

Tests run locally with numpy by default, but the same tests can be run against a local spark installation using

py.test --engine=spark