Awesome
duckdb_protobuf
a duckdb extension for parsing sequences of protobuf messages encoded in either the standard varint delimited format or a u32 big endian delimited format.
quick start
ensure you're using duckdb 1.1.0 for support with the latest features. if you need new features on an old versions, please open an issue.
$ duckdb -version
v1.1.0 fa5c2fe15f
start duckdb with -unsigned
flag to allow loading unsigned libraries.
$ duckdb -unsigned
or if you're using the jdbc connector, you can do this with the
allow_unsigned_extensions
jdbc connection property.
now install the extension:
INSTALL protobuf from 'https://duckdb.0xcaff.xyz';
next load it (you'll need to do this once for every session you want to use the extension)
LOAD protobuf;
and start shredding up your protobufs!
SELECT *
FROM protobuf(
descriptors = './descriptor.pb',
files = './scrape/data/SceneVersion/**/*.bin',
message_type = 'test_server.v1.GetUserSceneVersionResponse',
delimiter = 'BigEndianFixed'
)
LIMIT 10;
if you want builds for a platform or version which currently doesn't have builds, please open an issue.
<details> <summary>install from file</summary>download the latest version from releases. if you're on macOS, blow away the quarantine params with the following to allow the file to be loaded
$ xattr -d com.apple.quarantine /Users/martin/Downloads/protobuf.duckdb_extension
next load the extension
LOAD '/Users/martin/Downloads/protobuf.duckdb_extension';
</details>
why
sometimes you want to land your row primary data in a format with a well-defined structure and pretty good decode performance and poke around without a load step. maybe you're scraping an endpoint which returns protobuf responses, you're figuring out the schema as you go and iteration speed matters much more than query performance.
duckdb_protobuf
allows for making a new choice along the
flexibility-performance tradeoff continuum for fast exploration of protobuf
streams with little upfront load complexity or time.
configuration
descriptors
: path to the protobuf descriptor file. Generated using something likeprotoc --descriptor_set_out=descriptor.pb ...
files
: glob pattern for the files to read. Uses theglob
crate for evaluating globs.message_type
: the fully qualified message type to parse.delimiter
: specifies where one message starts and the next one beginsBigEndianFixed
: every message is prefixed with a u32 big endian value specifying its length. files are a sequence of messagesVarint
: every message is prefixed with a protobuf Varint value (encoding). files are a sequence of messagesSingleMessagePerFile
: each file contains a single message
filename
,position
andsize
: boolean values enabling columns which add source information about where the messages originated from
features
- converts
google.protobuf.Timestamp
messages to duckdb timestamp - supports nested messages with repeating fields
- scales decoding across as many threads as duckdb allows
- supports projection pushdown (for first level of columns) ensuring only necessary columns are decoded.
limitations
- doesn't support a few types (bytes, maps, {s,}fixed{32,64}, sint{32,64}), contributions and even feedback that these field types are used is welcome!
i'm releasing this to understand how other folks are using protobuf streams and duckdb. i'm open to PRs, issues and other feedback.