Home

Awesome

<img src="static/logo_c.png" width="200" height="175" alt="Logo">

docs cozo-node npm (web) Crates.io docs.rs pypi java clj android pod Go C GitHub Workflow Status GitHub

CozoDB

Table of contents

  1. Introduction
  2. Getting started
  3. Install
  4. Architecture
  5. Status of the project
  6. Links
  7. Licensing and contributing

๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ New versions ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

Version v0.7: after HNSW vector search from 0.6, in 0.7 we bring to you MinHash-LSH for near-duplicate search, full-text search, Json value support and more! See here for more details.


Version v0.6 released! This version brings vector search with HNSW indices inside Datalog, which can be integrated seamlessly with powerful features like ad-hoc joins, recursive Datalog and classical whole-graph algorithms. This significantly expanded the horizon of possibilities of CozoDB.

Highlights:

See here for more details.

Introduction

CozoDB is a general-purpose, transactional, relational database that uses Datalog for query, is embeddable but can also handle huge amounts of data and concurrency, and focuses on graph data and algorithms. It supports time travel and it is performant!

What does embeddable mean here?

A database is almost surely embedded if you can use it on a phone which never connects to any network (this situation is not as unusual as you might think). SQLite is embedded. MySQL/Postgres/Oracle are client-server.

A database is embedded if it runs in the same process as your main program. This is in contradistinction to client-server databases, where your program connects to a database server (maybe running on a separate machine) via a client library. Embedded databases generally require no setup and can be used in a much wider range of environments.

We say CozoDB is embeddable instead of embedded since you can also use it in client-server mode, which can make better use of server resources and allow much more concurrency than in embedded mode.

Why graphs?

Because data are inherently interconnected. Most insights about data can only be obtained if you take this interconnectedness into account.

Most existing graph databases start by requiring you to shoehorn your data into the labelled-property graph model. We don't go this route because we think the traditional relational model is much easier to work with for storing data, much more versatile, and can deal with graph data just fine. Even more importantly, the most piercing insights about data usually come from graph structures implicit several levels deep in your data. The relational model, being an algebra, can deal with it just fine. The property graph model, not so much, since that model is not very composable.

What is so cool about Datalog?

Datalog can express all relational queries. Recursion in Datalog is much easier to express, much more powerful, and usually runs faster than in SQL. Datalog is also extremely composable: you can build your queries piece by piece.

Recursion is especially important for graph queries. CozoDB's dialect of Datalog supercharges it even further by allowing recursion through a safe subset of aggregations, and by providing extremely efficient canned algorithms (such as PageRank) for the kinds of recursions frequently required in graph analysis.

As you learn Datalog, you will discover that the rules of Datalog are like functions in a programming language. Rules are composable, and decomposing a query into rules can make it clearer and more maintainable, with no loss in efficiency. This is unlike the monolithic approach taken by the SQL select-from-where in nested forms, which can sometimes read like golfing.

Time travel?

Time travel in the database setting means tracking changes to data over time and allowing queries to be logically executed at a point in time to get a historical view of the data.

In a sense, this makes your database immutable, since nothing is really deleted from the database ever.

In Cozo, instead of having all data automatically support time travel, we let you decide if you want the capability for each of your relation. Every extra functionality comes with its cost, and you don't want to pay the price if you don't use it.

For the reason why you might want time travel for your data, we have written a short story.

How performant?

On a 2020 Mac Mini with the RocksDB persistent storage engine (CozoDB supports many storage engines):

For more numbers and further details, we have a writeup about performance here.

Getting started

Usually, to learn a database, you need to install it first. This is unnecessary for CozoDB as a testimony to its extreme embeddability, since you can run a complete CozoDB instance in your browser, at near-native speed for most operations!

So open up the CozoDB in WASM page, and then:

Or you can skip ahead for the information about installing CozoDB into your favourite environment first.

Teasers

If you are in a hurry and just want a taste of what querying with CozoDB is like, here it is. In the following *route is a relation with two columns fr and to, representing a route between those airports, and FRA is the code for Frankfurt Airport.

How many airports are directly connected to FRA?

?[count_unique(to)] := *route{fr: 'FRA', to}
count_unique(to)
310

How many airports are reachable from FRA by one stop?

?[count_unique(to)] := *route{fr: 'FRA', to: stop},
                       *route{fr: stop, to}
count_unique(to)
2222

How many airports are reachable from FRA by any number of stops?

reachable[to] := *route{fr: 'FRA', to}
reachable[to] := reachable[stop], *route{fr: stop, to}
?[count_unique(to)] := reachable[to]
count_unique(to)
3462

What are the two most difficult-to-reach airports by the minimum number of hops required, starting from FRA?

shortest_paths[to, shortest(path)] := *route{fr: 'FRA', to},
                                      path = ['FRA', to]
shortest_paths[to, shortest(path)] := shortest_paths[stop, prev_path],
                                      *route{fr: stop, to},
                                      path = append(prev_path, to)
?[to, path, p_len] := shortest_paths[to, path], p_len = length(path)

:order -p_len
:limit 2
topathp_len
YPO["FRA","YYZ","YTS","YMO","YFA","ZKE","YAT","YPO"]8
BVI["FRA","AUH","BNE","ISA","BQL","BEU","BVI"]7

What is the shortest path between FRA and YPO, by actual distance travelled?

start[] <- [['FRA']]
end[] <- [['YPO]]
?[src, dst, distance, path] <~ ShortestPathDijkstra(*route[], start[], end[])
srcdstdistancepath
FRAYPO4544.0["FRA","YUL","YVO","YKQ","YMO","YFA","ZKE","YAT","YPO"]

CozoDB attempts to provide nice error messages when you make mistakes:

?[x, Y] := x = 1, y = x + 1
<pre><span style="color: rgb(204, 0, 0);">eval::unbound_symb_in_head</span><span> </span><span style="color: rgb(204, 0, 0);">ร—</span><span> Symbol 'Y' in rule head is unbound โ•ญโ”€โ”€โ”€โ”€ </span><span style="color: rgba(0, 0, 0, 0.5);">1</span><span> โ”‚ ?[x, Y] := x = 1, y = x + 1 ยท </span><span style="font-weight: bold; color: rgb(255, 0, 255);"> โ”€</span><span> โ•ฐโ”€โ”€โ”€โ”€ </span><span style="color: rgb(0, 153, 255);"> help: </span><span>Note that symbols occurring only in negated positions are not considered bound </span></pre>

Install

We suggest that you try out CozoDB before you install it in your environment.

How you install CozoDB depends on which environment you want to use it in. Follow the links in the table below:

Language/EnvironmentOfficial platform supportStorage
PythonLinux (x86_64), Mac (ARM64, x86_64), Windows (x86_64)MQR
NodeJSLinux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQR
Web browserModern browsers supporting web assemblyM
Java (JVM)Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQR
Clojure (JVM)Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQR
AndroidAndroid (ARM64, ARMv7, x86_64, x86)MQ
iOS/MacOS (Swift)iOS (ARM64, simulators), Mac (ARM64, x86_64)MQ
RustSource only, usable on any platform with std supportMQRST
GolangLinux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQR
C/C++/language with C FFILinux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQR
Standalone HTTP serverLinux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)MQRST
LispLinux (x86_64 so far)MR
SmalltalkWin10 & Linux (Ubuntu 23.04) x86_64 tested, MacOS should probably workMQR

For the storage column:

The Rust doc has some tips on choosing storage, which is helpful even if you are not using Rust. Even if a storage/platform is not officially supported, you can still try to compile your version to use, maybe with some tweaks in the code.

Tuning the RocksDB backend for CozoDB

RocksDB has a lot of options, and by tuning them you can achieve better performance for your workload. This is probably unnecessary for 95% of users, but if you are the remaining 5%, CozoDB gives you the options to tune RocksDB directly if you are using the RocksDB storage engine.

When you create the CozoDB instance with the RocksDB backend option, you are asked to provide a path to a directory to store the data (will be created if it does not exist). If you put a file named options inside this directory, the engine will expect this to be a RocksDB options file and use it. If you are using the standalone cozo executable, you will get a log message if this feature is activated.

Note that improperly set options can make your database misbehave! In general, you should run your database once, copy the options file from data/OPTIONS-XXXXXX from within your database directory, and use that as a base for your customization. If you are not an expert on RocksDB, we suggest you limit your changes to adjusting those numerical options that you at least have a vague understanding.

Architecture

CozoDB consists of three layers stuck on top of each other, with each layer only calling into the layer below:

<table> <tbody> <tr><td>(<i>User code</i>)</td></tr> <tr><td>Language/environment wrapper</td></tr> <tr><td>Query engine</td></tr> <tr><td>Storage engine</td></tr> <tr><td>(<i>Operating system</i>)</td></tr> </tbody> </table>

Storage engine

The storage engine defines a storage trait for the storage backend, which is an interface with required operations, mainly the provision of a key-value store for binary data with range scan capabilities. There are various implementations:

Depending on the build configuration, not all backends may be available in a binary release. The SQLite backend is special in that it is also used as the backup file format, which allows the exchange of data between databases with different backends. If you are using the database embedded in Rust, you can even provide your own custom backend.

The storage engine also defines a row-oriented binary data format, which the storage engine implementation does not need to know anything about. This format contains an implementation of the memcomparable format used for the keys, which enables the storage of rows of data as binary blobs that, when sorted lexicographically, give the correct order. This also means that data files for the SQLite backend cannot be queried with SQL in the usual way, and access must be through the decoding process in CozoDB.

Query engine

The query engine part provides various functionalities:

This part is where most of the code of CozoDB is concerned. The CozoScript manual has a chapter about the execution process.

Users interact with the query engine with the Rust API.

Language/environment wrapper

For all languages/environments except Rust, this part just translates the Rust API into something that can be easily consumed by the targets. For Rust, there is no wrapper. For example, in the case of the standalone server, the Rust API is translated into HTTP endpoints, whereas in the case of NodeJS, the (synchronous) Rust API is translated into a series of asynchronous calls from the JavaScript runtime.

If you want to make CozoDB usable in other languages, this part is where your focus should be. Any existing generic interop libraries between Rust and your target language would make the job much easier. Otherwise, you can consider wrapping the C API, as this is supported by most languages. For the languages officially supported, only Golang wraps the C API directly.

Status of the project

CozoDB is still very young, but we encourage you to try it out for your use case. Any feedback is welcome.

Versions before 1.0 do not promise syntax/API stability or storage compatibility.

Links

Licensing and contributing

This project is licensed under MPL-2.0 or later. See here if you are interested in contributing to the project.