Awesome
DBStage – Flexible, Staged Query Compilation Playground
Introduction
This repository contains a proof of concept for a configurable language-integrated runtime query compiler based on staging. The implementation relies on the Squid type-safe metaprogramming framework for Scala, which makes its code manipulation and generation capabilities fairly robust.
The main features are:
-
An expressive SQL embedded DSL (still currently missing many features), with type-safe integration to normal Scala programs (LinQ-style);
-
A backend implemented using powerful abstractions and Scala modular programming, which allows great configurability at no runtime cost: experiment and combine different ways to store relation tables (column store, row store, hash map), different ways to index them, different ways to query them (push, pull), etc.
Step by step
1. define the database relations
case object Person extends Relation {
val Id = Column[Int]("Id", primary = true)
val Name = Column[String]("Name")
val Age = Column[Int]("Age")
val Sex = Column[Sex]("Sex")
}
2.a. register queries to be executed later, using a SQL-like DSL
import Person._
val q0 = from(Person) where ir"$Age > 18" where ir"$Sex == Male" select (Name,Age)
(Of course, one can write where ir"$Age > 18 && $Sex == Male"
equivalently.)
Note that column types are checked at compile-time, but column reference consistency and ambiguities are checked at query construction time (runtime).
For example if I had written select (Name,Age,Salary)
it would have complained at runtime that there are no such Salary column available. (It would be easy to have a compile-time linter written in Squid to catch these errors earlier.)
2.b. load the data from the file system
Person.loadDataFromFile("data/persons.csv", compileCode = true)
This compiles a program on-the-fly to efficiently load the data given the relation schema.
2.c. on-the-fly compile and execute queries
q0.plan.foreach { case (name, age) => assert(age > 18); println(s"$name $age") }
Notice that the types for name
and age
are correctly inferred as String and Int, respectively.
Importantly, steps 2.a, 2.b and 2.c can be done in any order and can be interleaved.
Another example: all pairs of people of the same age but opposite sex:
val m = from(Person)
val f = from(Person)
val q = ((m where ir"$Sex == Male") join (f where ir"$Sex == Female"))(ir"${m.Age} == ${f.Age}")
.select (m.Age, m.Name, f.Name, m.Id, f.Id)
q.printLines
Which prints the following:
Age(0) | Name(0) | Name(1) | Id(0) | Id(1) |
---|---|---|---|---|
41 | bob parker | julia kenn | 1 | 6 |
... | ... | ... | ... | ... |
The currently supported functionalities are:
- Selection, projection, filtering, (hash) joins
- Option to load data in a hashmap where the keys are the primary keys of the relation; this structure is then used to perform faster joins
- Option to store data in column store, on a per-relation basis (if the above is not applied on the given relation)
- User-defined functions and data types
- Pushing and pulling are both supported
- The type-safe DSL means one can integrate queries inside general purpose program, using DBStage as a simple Scala library
- Engine is agnostic in the underlying data structures and row representation; tables currently use tuples and Scala ArrayBuffer/HashMap's, but we could easily experiment with off-heap memory to avoid boxing, for example.
What I'd like to have in the future:
- Aggregations, grouping, sorting
- Customize the storage of data optimizing for registered queries (possibly adapt it dynamically as more queries are registered)
- Option to instrument the data loading code to add more error recovery and/or add data analytics guiding subsequent query compilation
- Extend the SQL subset with updates, perhaps transactions
- Handling of data on disk, and associated cache management?