Home

Awesome

MinimumViableTables

Build Status Coverage Status codecov.io

This package is an attempt to make a minimal interface for fast tables and relational operations in Julia. The primary goals are to provide:

The package is a work-in-progress and currently only provides a rather low-level interface to relational algebra operations and acceleration indexes. A higher-level interface for end users may be desirable.

Tables

A "table" is a finite relation, which is simply a collection of named tuples (rows). Here, we think of this as any AbstractVector{<:NamedTuple{names}} (though we may extend this to AbstractDicts later). The package provides a concrete Table type which stores data as columns, but presents it to Julia as rows of NamedTuple (i.e. a relation).

Constructing a Table is straight forward, such as:

Table(a = [1,2,3], b = [2.0, 4.0, 6.0])

Relational algebra consists of a small set of operations. One can "project" a table to fewer columns using:

project(table, (:a, :b))
# or
Project{(:a, :b)}()(table)

Similarly columns can be renamed with the rename function (or Rename singleton type).

Rows can be manipulated with the standard Julia AbstractVector interface. Operations like vcat, union, intersect, setdiff will produce new vectors of named tuples. We can "select" certain rows using filter, or find the row indexes of certain rows using findall, or use map to construct a boolean array indicating the image of matching- and non-matching rows. For example, we may chose to use a perfectly standard Julia filter operation to obtain only the rows where the values in column a are equal to 1 like so:

filter(row -> row.a == 1, table)

To make certain operations faster, tables may provide a set of "acceleration indexes" which allow mapping of some condition to the row indices. These might include a mapping from the hash of some columns to a row index, or the row indices sorted in a certain order, or simply the knowledge that the values in one or more columns is unique.

Acceleration Indices

An acceleration index provides information about one-or-more columns of a Table to allow for fast lookup of relevant rows. These indexes may accelerate operations such as filtering rows, performing grouping and aggregates, or joins with other tables, and are typically attached to a Table with the accelerate function like so:

table2 = accelerate(table, HashIndex{(:a,)})

Multiple indexes may be attached to a single Table. The built-in set of AbstractIndexes are:

External packages are free to create subtypes of AbstractIndex{names} and specialized algorithms that make use of the new accelerations.

Predicates on a single table

Julia needs a way of determining how to use an acceleration index to make a given operation faster. We might attempt to use an anonymous function to indicate which rows to select, like our earlier example:

filter(row -> row.a == 1, table)

Unfortunately, this obscures a lot of information about our query such as that we only care about the values in column :a, and that we are performing an equality comparison to 1. This means we have no way of knowing how to use one of acceleration indexes to make this operation faster.

To solve this, this package provides the Predicate abstract type. Subtypes of Predicate{names} are Functions that always return a Bool and depend only on the values in the columns names. Each Predicate may be used with filter, findall or map in combination with an acceleration index to achieve a faster result than a full table scan. The package will attempt to use the best acceleration index for the given Predicate and operation. Some example usage:

filter(IsEqual(a = 1), table)
findall(IsLess(b = 2.0), table)
map(In(c = 3..10), table)

The built-in predicates designed to work on a single table are:

Products of tables, and joins

The final relational algebra operation we haven't mentioned is the Cartesian product of two tables. This package provides the ProductTable type which is an AbstractMatrix of NamedTuples. Example usage:

t3 = ProductTable(t1, t2)

So far, we haven't mentioned a "join" operation. In relation algebra, a join is simply the combination of a Cartesian product operation followed by a filter over a Predicate such as equality between two columns. Predicates that relate multiple columns of a table are provided and can be used to accelerate filter, findall and map of a ProductTable, i.e. facilitate fast joins. For example

t4 = filter(Equals(:a, :b), t3)

will join t1 and t2 on equality of columns :a and :b. The set of built-in comparitive Predicates are

The joining operation can occur in one of three ways

What next?

This package needs some more concise surface-level syntax, such as some concept of join and groupby and so-on. One possibility is to integrate this with SplitApplyCombine.jl.

So far all Tables are vectors and ProductTables are limited to 2D. Ideally we could take the Cartesian product of arbitrarily many tables.

For ultimate speed of complex queries, we will probably need to make some more operations behave in a lazy, streaming fashion so e.g. multiple filtering operations may occur in a single loop.

All the predicates are using isequal and isless, for simplicity. We should really support sensible data behavior with NaN and missing.

As always, performance can be optimized. There are some simple benchmarks in the benchmarks/ directory.

I/O operations (including show) should be provided somehow (hopefully using the package ecosystem).