Home

Awesome

LowRankModels.jl

Build Status Build status codecov

LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low rank matrix, and include many well known models in data analysis, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more.

For more information on GLRMs, see our paper. There is a python interface to this package, and a GLRM implementation in the H2O machine learning platform with interfaces in a variety of languages.

LowRankModels.jl makes it easy to mix and match loss functions and regularizers to construct a model suitable for a particular data set. In particular, it supports

Installation

To install, just call

Pkg.add("LowRankModels")

at the Julia prompt.

Generalized Low Rank Models

GLRMs form a low rank model for tabular data A with m rows and n columns, which can be input as an array or any array-like object (for example, a data frame). It is fine if only some of the entries have been observed (i.e., the others are missing); the GLRM will only be fit on the !ismissing entries. The desired model is specified by choosing a rank k for the model, an array of loss functions losses, and two regularizers, rx and ry. The data is modeled as X'*Y, where X is a kxm matrix and Y is a kxn matrix. X and Y are found by solving the optimization problem

<!--``\mbox{minimize} \quad \sum_{(i,j) \in \Omega} L_{ij}(x_i y_j, A_{ij}) + \sum_{i=1}^m r_i(x_i) + \sum_{j=1}^n \tilde r_j(y_j)``-->
minimize sum_{(i,j) in obs} losses[j]((X'*Y)[i,j], A[i,j]) + sum_i rx(X[:,i]) + sum_j ry(Y[:,j])

The basic type used by LowRankModels.jl is the GLRM. To form a GLRM, the user specifies

The user may also specify

obs is a list of tuples of the indices of the observed entries in the matrix, and may be omitted if all the entries in the matrix have been observed. If A is a sparse matrix, implicit zeros are interpreted as missing entries by default; see the discussion of sparse matrices below for more details. X₀ and Y₀ are initialization matrices that represent a starting guess for the optimization.

Losses and regularizers must be of type Loss and Regularizer, respectively, and may be chosen from a list of supported losses and regularizers, which include

Losses:

The constructors for all the ordinal and categorical losses take as an argument the maximum (or both minimum and maximum) value the variable may take. Using the one-vs-all loss is equivalent to transforming a categorical value to a one-hot vector and using a binary loss on each entry in that vector. Using the bigger-vs-smaller loss is equivalent to transforming the ordinal value to a Boolean vector and using a binary loss on each entry in that vector. By default, the binary loss used is the logistic loss.

Regularizers:

Each of these losses and regularizers can be scaled (for example, to increase the importance of the loss relative to the regularizer) by calling mul!(loss, newscale). Users may also implement their own losses and regularizers, or adjust internal parameters of the losses and regularizers; see losses.jl and regularizers.jl for more details.

Example

For example, the following code forms a k-means model with k=5 on the 100x100 matrix A:

using LowRankModels
m, n, k = 100, 100, 5
losses = QuadLoss() # minimize squared distance to cluster centroids
rx = UnitOneSparseConstraint() # each row is assigned to exactly one cluster
ry = ZeroReg() # no regularization on the cluster centroids
glrm = GLRM(A, losses, rx, ry, k)

To fit the model, call

X, Y, ch = fit!(glrm)

which runs an alternating directions proximal gradient method on glrm to find the X and Y minimizing the objective function. (ch gives the convergence history; see Technical details below for more information.)

The losses argument can also be an array of loss functions, with one for each column (in order). For example, for a data set with 3 columns, you could use

losses = Loss[QuadLoss(), LogisticLoss(), HingeLoss()]

Similiarly, the ry argument can be an array of regularizers, with one for each column (in order). For example, for a data set with 3 columns, you could use

ry = Regularizer[QuadReg(1), QuadReg(10), FixedLatentFeaturesConstraint([1.,2.,3.])]

This regularizes the first to columns of Y with ||Y[:,1]||^2 + 10||Y[:,2]||^2 and constrains the third (and last) column of Y to be equal to [1,2,3].

More examples here.

Missing data

If not all entries are present in your data table, just tell the GLRM which observations to fit the model to by listing tuples of their indices in obs, e.g., if obs=[(1,2), (5,3)], exactly two entries have been observed. Then initialize the model using

GLRM(A, losses, rx, ry, k, obs=obs)

If A is a DataFrame and you just want the model to ignore any entry that is missing, you can use

obs = observations(A)

Standard low rank models

Low rank models can easily be used to fit standard models such as PCA, k-means, and nonnegative matrix factorization. The following functions are available:

See the code for usage. Any keyword argument valid for a GLRM object, such as an initial value for X or Y or a list of observations, can also be used with these standard low rank models.

Scaling and offsets <a name="scaling"></a>

If you choose, LowRankModels.jl can add an offset to your model and scale the loss functions and regularizers so all columns have the same pull in the model. Simply call

glrm = GLRM(A, losses, rx, ry, k, offset=true, scale=true)

This transformation generalizes standardization, a common preprocessing technique applied before PCA. (For more about offsets and scaling, see the code or the paper.)

You can also add offsets and scalings to previously unscaled models:

add_offset!(glrm)
equilibrate_variance!(glrm)
prob_scale!(glrm)

Fitting DataFrames

Perhaps all this sounds like too much work. Perhaps you happen to have a DataFrame df lying around that you'd like a low rank (e.g., k=2) model for. For example,

import RDatasets
df = RDatasets.dataset("psych", "msq")

Never fear! Just call

glrm, labels = GLRM(df, k)
X, Y, ch = fit!(glrm)

This will fit a GLRM with rank k to your data, using a QuadLoss loss for real valued columns, HingeLoss loss for boolean columns, and ordinal HingeLoss loss for integer columns, a small amount of QuadLoss regularization, and scaling and adding an offset to the model as described here. It returns the column labels for the columns it fit, along with the model. Right now, all other data types are ignored. NaN values are treated as missing values (missings) and ignored in the fit.

The full call signature is

function GLRM(df::DataFrame, k::Int;
              losses = Loss[], rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false,
              prob_scale = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

Or to specify a map from data types to losses, define a new loss_map from datatypes to losses (like probabilistic_losses, below):

probabilistic_losses = Dict{Symbol, Any}(
    :real        => QuadLoss,
    :bool        => LogisticLoss,
    :ord         => MultinomialOrdinalLoss,
    :cat         => MultinomialLoss
)

and input an array of datatypes (one for each column of your data frame: GLRM(A, k, datatypes; loss_map = loss_map). The full call signature is

function GLRM(df::DataFrame, k::Int, datatypes::Array{Symbol,1};
              loss_map = probabilistic_losses,
              rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false, prob_scale = true,
              transform_data_to_numbers = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

To fit a data frame with categorical values, you can use the function expand_categoricals! to turn categorical columns into a Boolean column for each level of the categorical variable. For example, expand_categoricals!(df, [:gender]) will replace the gender column with a column corresponding to gender=male, a column corresponding to gender=female, and other columns corresponding to labels outside the gender binary, if they appear in the data set.

You can use the model to get some intuition for the data set. For example, try plotting the columns of Y with the labels; you might see that similar features are close to each other!

Fitting Sparse Matrices

If you have a very large, sparsely observed dataset, then you may want to encode your data as a sparse matrix. By default, LowRankModels interprets the sparse entries of a sparse matrix as missing entries (i.e. NA values). There is no need to pass the indices of observed entries (obs) -- this is done automatically when GLRM(A::SparseMatrixCSC,...) is called. In addition, calling fit!(glrm) when glrm.A is a sparse matrix will use the sparse variant of the proximal gradient descent algorithm, fit!(glrm, SparseProxGradParams(); kwargs...).

If, instead, you'd like to interpret the sparse entries as zeros, rather than missing or NA entries, use:

glrm = GLRM(...; sparse_na=false)

In this case, the dataset is dense in terms of observations, but sparse in terms of nonzero values. Thus, it may make more sense to fit the model with the vanilla proximal gradient descent algorithm, fit!(glrm, ProxGradParams(); kwargs...).

Parallel fitting (experimental)

LowRankModels makes use of Julia v0.5's new multithreading functionality to fit models in parallel. To fit a LowRankModel in parallel using multithreading, simply set the number of threads from the command line before starting Julia: e.g.,

export JULIA_NUM_THREADS=4

Technical details

Optimization

The function fit! uses an alternating directions proximal gradient method to minimize the objective. This method is not guaranteed to converge to the optimum, or even to a local minimum. If your code is not converging or is converging to a model you dislike, there are a number of parameters you can tweak.

Warm start

The algorithm starts with glrm.X and glrm.Y as the initial estimates for X and Y. If these are not given explicitly, they will be initialized randomly. If you have a good guess for a model, try setting them explicitly. If you think that you're getting stuck in a local minimum, try reinitializing your GLRM (so as to construct a new initial random point) and see if the model you obtain improves.

The function fit! sets the fields glrm.X and glrm.Y after fitting the model. This is particularly useful if you want to use the model you generate as a warm start for further iterations. If you prefer to preserve the original glrm.X and glrm.Y (e.g., for cross validation), you should call the function fit, which does not mutate its arguments.

You can even start with an easy-to-optimize loss function, run fit!, change the loss function (glrm.losses = newlosses), and keep going from your warm start by calling fit! again to fit the new loss functions.

Initialization

If you don't have a good guess at a warm start for your model, you might try one of the initializations provided in LowRankModels.

Parameters

As mentioned earlier, LowRankModels uses alternating proximal gradient descent to derive estimates of X and Y. This can be done by two slightly different procedures: (A) compute the full reconstruction, X' * Y, to compute the gradient and objective function; (B) only compute the model estimate for entries of A that are observed. The first method is likely preferred when there are few missing entries for A because of hardware level optimizations (e.g. chunking the operations so they just fit in various caches). The second method is likely preferred when there are many missing entries of A.

To fit with the first (dense) method:

fit!(glrm, ProxGradParams(); kwargs...)

To fit with the second (sparse) method:

fit!(glrm, SparseProxGradParams(); kwargs...)

The first method is used by default if glrm.A is a standard matrix/array. The second method is used by default if glrm.A is a SparseMatrixCSC.

ProxGradParams() and SparseProxGradParams() run these respective methods with the default parameters:

The default parameters are: ProxGradParams(stepsize=1.0;max_iter=100,inner_iter=1,abs_tol=0.00001,rel_tol=0.0001,min_stepsize=0.01*stepsize)

Convergence

ch gives the convergence history so that the success of the optimization can be monitored; ch.objective stores the objective values, and ch.times captures the times these objective values were achieved. Try plotting this to see if you just need to increase max_iter to converge to a better model.

Imputation

After fitting a GLRM, you can use it to impute values of A in four different ways:

Cross validation

A number of useful functions are available to help you check whether a given low rank model overfits to the test data set. These functions should help you choose adequate regularization for your model.

Cross validation

Regularization paths

Utilities

ScikitLearn

This library implements the ScikitLearn.jl interface. These models are available: SkGLRM, PCA, QPCA, NNMF, KMeans, RPCA. See their docstrings for more information (e.g. ?QPCA). All models support the ScikitLearnBase.fit! and ScikitLearnBase.transform interface. Examples:

## Apply PCA to the iris dataset
using LowRankModels
import ScikitLearnBase
using RDatasets    # may require Pkg.add("RDatasets")

A = convert(Matrix, dataset("datasets", "iris")[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
ScikitLearnBase.fit_transform!(PCA(k=3, max_iter=500), A)
## Fit K-Means to a fake dataset of two Gaussians
using LowRankModels
import ScikitLearnBase

# Generate two disjoint Gaussians with 100 and 50 points
gaussian1 = randn(100, 2) + 5
gaussian2 = randn(50, 2) - 10
# Merge them into a single dataset
A = vcat(gaussian1, gaussian2)

model = ScikitLearnBase.fit!(LowRankModels.KMeans(), A)
# Count how many points are assigned to each Gaussians (should be 100 and 50)
Set(sum(ScikitLearnBase.transform(model, A), 1))

See also this notebook demonstrating K-Means.

These models can be used inside a ScikitLearn pipeline, and every hyperparameter can be tuned with GridSearchCV.

Citing this package

If you use LowRankModels for published work, we encourage you to cite the software.

Use the following BibTeX citation:

@article{glrm,
    title = {Generalized Low Rank Models},
    author ={Madeleine Udell and Horn, Corinne and Zadeh, Reza and Boyd, Stephen},
    doi = {10.1561/2200000055},
    year = {2016},
    archivePrefix = "arXiv",
    eprint = {1410.0342},
    primaryClass = "stat-ml",
    journal = {Foundations and Trends in Machine Learning},
    number = {1},
    volume = {9},
    issn = {1935-8237},
    url = {http://dx.doi.org/10.1561/2200000055},
}