Awesome

kwack - In-Memory Analytics for Kafka using DuckDB

kwack supports in-memory analytics for Kafka data using DuckDB.

Getting Started

Note that kwack requires Java 11 or higher.

To run kwack, download a release, unpack it. Then change to the kwack-${version} directory and run the following to see the command-line options:

$ bin/kwack -h

Usage: kwack [-hV] [-t=<topic>]... [-p=<partition>]... [-b=<broker>]...
             [-m=<ms>] [-F=<config-file>] [-o=<offset>] [-k=<topic=serde>]...
             [-v=<topic=serde>]... [-r=<url>] [-q=<query>] [-a=<attr>]...
             [-d=<db>] [-X=<prop=val>]...
In-Memory Analytics for Kafka using DuckDB.
  -t, --topic=<topic>               Topic(s) to consume from and produce to
  -p, --partition=<partition>       Partition(s)
  -b, --bootstrap-server=<broker>   Bootstrap broker(s) (host:[port])
  -m, --metadata-timeout=<ms>       Metadata (et.al.) request timeout
  -F, --file=<config-file>          Read configuration properties from file
  -o, --offset=<offset>             Offset to start consuming from:
                                      beginning | end |
                                      <value>  (absolute offset) |
                                      -<value> (relative offset from end)
                                      @<value> (timestamp in ms to start at)
                                      Default: beginning
  -k, --key-serde=<topic=serde>     (De)serialize keys using <serde>
  -v, --value-serde=<topic=serde>   (De)serialize values using <serde>
                                    Available serdes:
                                      short | int | long | float |
                                      double | string | json | binary |
                                      avro:<schema|@file> |
                                      json:<schema|@file> |
                                      proto:<schema|@file> |
                                      latest (use latest version in SR) |
                                      <id>   (use schema id from SR)
                                      Default for key:   binary
                                      Default for value: latest
                                    The proto/latest/<id> serde formats can
                                    also take a message type name, e.g.
                                      proto:<schema|@file>;msg:<name>
                                    in case multiple message types exist
  -r, --schema-registry-url=<url>   SR (Schema Registry) URL
  -q, --query=<query>               SQL query to execute. If none is specified,
                                      interactive sqlline mode is used
  -a, --row-attribute=<attr>        Row attribute(s) to show:
                                      none
                                      rowkey (record key)
                                      ksi    (key schema id)
                                      vsi    (value schema id)
                                      top    (topic)
                                      par    (partition)
                                      off    (offset)
                                      ts     (timestamp)
                                      tst    (timestamp type)
                                      epo    (leadership epoch)
                                      hdr    (headers)
                                      Default: rowkey,ksi,vsi,par,off,ts,hdr
  -d, --db=<db>                     DuckDB db, appended to 'jdbc:duckdb:'
                                      Default: :memory:
  -x, --skip-bytes=<bytes>          Extra bytes to skip when deserializing with
                                      an external schema
  -X, --property=<prop=val>         Set configuration property.
  -h, --help                        Show this help message and exit.
  -V, --version                     Print version information and exit.

kwack shares many command-line options with kcat (formerly kafkacat). In addition, a file containing configuration properties can be used. The available configuration properties are listed here.

Simply modify config/kwack.properties to point to an existing Kafka broker and Schema Registry. Then run the following:

# Run with properties file
$ bin/kwack -F config/kwack.properties

Starting kwack is as easy as specifying a Kafka broker, topic, and Schema Registry URL:

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081
Welcome to kwack!
Enter "!help" for usage hints.

      ___(.)>
~~~~~~\___)~~~~~~

jdbc:duckdb::memory:>

When kwack starts, it will enter interactive mode, where you can enter SQL queries to analyze Kafka data. For non-interactive mode, specify a query on the command line:

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic"

The output of the above command will be in JSON, and so can be piped to other commands like jq.

One can load multiple topics, and then perform a query that joins the resulting tables on a common column:

$ bin/kwack -b mybroker -t mytopic -t mytopic2 -r http://schema-registry-url:8081 -q "SELECT * FROM mytopic JOIN mytopic2 USING (col1)"

One can convert Kafka data into Parquet format by using the COPY commmand in DuckDB:

$ bin/kwack -b mybroker -t mytopic -r http://schema-registry-url:8081 -q "COPY mytopic to 'mytopic.parquet' (FORMAT 'parquet')"

If not using Confluent Schema Registry, one can pass an external schema:

$ bin/kwack -b mybroker -t mytopic -v mytopic=proto:@/path/to/myschema.proto

For a given schema, kwack will create DuckDB columns based on the appropriate Avro, Protobuf, or JSON Schema as follows:

Avro	Protobuf	JSON Schema	DuckDB
boolean	boolean	boolean	BOOLEAN
int	int32, sint32, sfixed32		INTEGER
	uint32, fixed32		UINTEGER
long	int64. sint64, sfixed64	integer	BIGINT
	uint64, fixed64		UBIGINT
float	float		FLOAT
double	double	number	DOUBLE
string	string	string	VARCHAR
bytes, fixed	bytes		BLOB
enum	enum	enum	ENUM
record	message	object	STRUCT
array	repeated	array	LIST
map	map		MAP
union	oneof	oneOf,anyOf	UNION
decimal	confluent.type.Decimal		DECIMAL
date	google.type.Date		DATE
time-millis, time-micros	google.type.TimeOfDay		TIME
timestamp-millis			TIMESTAMP_MS
timestamp-micros			TIMESTAMP
timestamp-nanos	google.protobuf.Timestamp		TIMESTAMP_NS
duration	google.protobuf.Duration		INTERVAL
uuid			UUID

For more on how to use kwack, see this blog.