Home

Awesome

TIP Catalog for Apache Iceberg

License Unittests Spark Integration Pyiceberg Integration Trino Integration Starrocks Integration Artifact Hub Docker on quay Helm Chart Discord

This is TIP: A Rust-native implementation of the Apache Iceberg REST Catalog specification based on apache/iceberg-rust.

If you have questions, feature requests or just want a chat, we are hanging around in Discord!

Scope and Features

The Iceberg Protocol (TIP) based on REST has become the standard for catalogs in open Lakehouses. It natively enables multi-table commits, server-side deconflicting and much more. It is figuratively the (TIP) of the Iceberg.

We have started this implementation because we were missing customizability, support for on-premise deployments and other features that are important for us in existing Iceberg Catalogs. Please find following some of our focuses with this implementation:

Please find following an overview of currently supported features. Please also check the Issues if you are missing something.

Quickstart

A Docker Container is available on quay.io. We have prepared a self-contained docker-compose file to demonstrate the usage of spark with our catalog:

git clone https://github.com/hansetag/iceberg-catalog.git
cd iceberg-catalog/examples
docker compose up

Then open your browser and head to localhost:8888.

For more information on deployment, please check the User Guide.

Status

Supported Operations - Iceberg-Rest

OperationStatusDescription
NamespacedoneAll operations implemented
TabledoneAll operations implemented - additional integration tests in development
ViewsdoneRemove unused files and log entries
MetricsopenEndpoint is available but doesn't store the metrics

Storage Profile Support

StorageStatusComment
S3 - AWSsemi-donevended-credentials & remote-signing, assume role missing
S3 - Customdonevended-credentials & remote-signing, tested against minio
Azure ADLS Gen2done
Azure Blobopen
Microsoft OneLakeopen
Google Cloud Storageopen

Details on how to configure the storage profiles can be found in the Storage Guide.

Supported Catalog Backends

BackendStatusComment
Postgresdone
MongoDBopen

Supported Secret Stores

BackendStatusComment
Postgresdone
kv2 (hcp-vault)doneuserpass auth

Supported Event Stores

BackendStatusComment
Natsdone
Kafkaopen

Supported Operations - Management API

OperationStatusDescription
Warehouse ManagementdoneCreate / Update / Delete a Warehouse
AuthZopenManage access to warehouses, namespaces and tables
More to come!open

Auth(N/Z) Handlers

OperationStatusDescription
OIDC (AuthN)doneSecure access to the catalog via OIDC
Custom (AuthZ)doneIf you are willing to implement a single rust Trait, the AuthZHandler can be implement to connect to your system
OpenFGA (AuthZ)openInternal Authorization management

Multiple Projects

The iceberg-rest server can host multiple independent warehouses that are again grouped by projects. The overall structure looks like this:

<project-1-uuid>/
├─ foo-warehouse
├─ bar-warehouse
<project-2-uuid>/
├─ foo-warehouse
├─ bas-warehouse

All warehouses use isolated namespaces and can be configured in client by specifying warehouse as '<project-uuid>/<warehouse-name>'. Warehouse Names inside Projects must be unique. We recommend using human readable names for warehouses.

If you do not need the hierarchy level of projects, set the ICEBERG_REST__DEFAULT_PROJECT_ID environment variable to the project you want to use. For single project deployments we recommend using the NULL UUID (" 00000000-0000-0000-0000-000000000000") as project-id. Users then just specify warehouse as <warehouse-name> when connecting.

Undrop Tables

When a table or view is dropped, it is not immediately deleted from the catalog. Instead, it is marked as dropped and a job for its cleanup is scheduled. The table, including its data if purgeRequested=True, is then deleted after the configured ICEBERG_TABULAR_EXPIRATION_DELAY_SECONDS (default: 7 days) have passed. This will allow for a recovery of tables that have been dropped by accident.

Configuration

The basic setup of the Catalog is configured via environment variables. As this catalog supports a multi-tenant setup, each catalog ("warehouse") also comes with its own configuration options including its Storage Configuration. The documentation of the Management-API for warehouses is hosted at the unprotected /swagger-ui endpoint.

Following options are global and apply to all warehouses:

General

VariableExampleDescription
ICEBERG_REST__BASE_URIhttps://example.com:8080Base URL where the catalog is externally reachable. Default: https://localhost:8080
ICEBERG_REST__DEFAULT_PROJECT_ID00000000-0000-0000-0000-000000000000The default project ID to use if the user does not specify a project when connecting. We recommend setting the Project-ID only in single Project setups. Each Project can still contain multiple Warehouses. Default: Not set.
ICEBERG_REST__RESERVED_NAMESPACESsystem,examplesReserved Namespaces that cannot be created via the REST interface
ICEBERG_REST__METRICS_PORT9000Port where the metrics endpoint is reachable. Default: 9000
ICEBERG_REST__LISTEN_PORT8080Port the server listens on. Default: 8080
ICEBERG_REST__SECRET_BACKENDpostgresThe secret backend to use. If kv2 is chosen, you need to provide additional parameters found under Default: postgres, one-of: [postgres, kv2]
ICEBERG_DEFAULT_TABULAR_EXPIRATION_DELAY_SECONDS86400Time after which a tabular, i.e. View or Table which has been dropped is going to be deleted. Default: 86400 [7 days]

Task queues

Currently, the catalog uses two task queues, one to ultimately delete soft-deleted tabulars and another to purge tabulars which have been deleted with the purgeRequested=True query parameter. The task queues are configured as follows:

VariableExampleDescription
ICEBERG_REST__QUEUE_CONFIG__MAX_RETRIES5Number of retries before a task is considered failed Default: 5
ICEBERG_REST__QUEUE_CONFIG__MAX_AGE3600Amount of seconds before a task is considered stale and could be picked up by another worker. Default: 3600
ICEBERG_REST__QUEUE_CONFIG__POLL_INTERVAL10Amount of seconds between polling for new tasks. Default: 10

The queues are currently implemented using the sqlx Postgres backend. If you want to use a different backend, you need to implement the TaskQueue trait.

Postgres

Configuration parameters if Postgres is used as a backend, you may either provide connection strings or use the PG_* environment variables, connection strings take precedence:

VariableExampleDescription
ICEBERG_REST__PG_DATABASE_URL_READpostgres://postgres:password@localhost:5432/icebergPostgres Database connection string used for reading
ICEBERG_REST__PG_DATABASE_URL_WRITEpostgres://postgres:password@localhost:5432/icebergPostgres Database connection string used for writing.
ICEBERG_REST__PG_ENCRYPTION_KEY<This is unsafe, please set a proper key>If ICEBERG_REST__SECRET_BACKEND=postgres, this key is used to encrypt secrets. It is required to change this for production deployments.
ICEBERG_REST__PG_READ_POOL_CONNECTIONS10Number of connections in the read pool
ICEBERG_REST__PG_WRITE_POOL_CONNECTIONS5Number of connections in the write pool
ICEBERG_REST__PG_HOST_RlocalhostHostname for read operations
ICEBERG_REST__PG_HOST_WlocalhostHostname for write operations
ICEBERG_REST__PG_PORT5432Port number
ICEBERG_REST__PG_USERpostgresUsername for authentication
ICEBERG_REST__PG_PASSWORDpasswordPassword for authentication
ICEBERG_REST__PG_DATABASEicebergDatabase name
ICEBERG_REST__PG_SSL_MODErequireSSL mode (disable, allow, prefer, require)
ICEBERG_REST__PG_SSL_ROOT_CERT/path/to/root/certPath to SSL root certificate
ICEBERG_REST__PG_ENABLE_STATEMENT_LOGGINGtrueEnable SQL statement logging
ICEBERG_REST__PG_TEST_BEFORE_ACQUIREtrueTest connections before acquiring from the pool
ICEBERG_REST__PG_CONNECTION_MAX_LIFETIME1800Maximum lifetime of connections in seconds

KV2 (HCP Vault)

Configuration parameters if a KV2 compatible storage is used as a backend. Currently, we only support the userpass authentication method. You may provide the envs as single values like ICEBERG_REST__KV2__URL=http://vault.local etc. or as a compound value like: ICEBERG_REST__KV2='{url="http://localhost:1234", user="test", password="test", secret_mount="secret"}'

VariableExampleDescription
ICEBERG_REST__KV2__URLhttps://vault.localURL of the KV2 backend
ICEBERG_REST__KV2__USERadminUsername to authenticate against the KV2 backend
ICEBERG_REST__KV2__PASSWORDpasswordPassword to authenticate against the KV2 backend
ICEBERG_REST__KV2__SECRET_MOUNTkv/data/icebergPath to the secret mount in the KV2 backend

Nats

If you want the server to publish events to a NATS server, set the following environment variables:

VariableExampleDescription
ICEBERG_REST__NATS_ADDRESSnats://localhost:4222The URL of the NATS server to connect to
ICEBERG_REST__NATS_TOPICicebergThe subject to publish events to
ICEBERG_REST__NATS_USERtest-userUser to authenticate against nats, needs ICEBERG_REST__NATS_PASSWORD
ICEBERG_REST__NATS_PASSWORDtest-passwordPassword to authenticate against nats, needs ICEBERG_REST__NATS_USER
ICEBERG_REST__NATS_CREDS_FILE/path/to/file.credsPath to a file containing nats credentials
ICEBERG_REST__NATS_TOKENxyzNats token to authenticate against server

OpenID Connect

If you want to limit ac cess to the API, set ICEBERG_REST__OPENID_PROVIDER_URI to the URI of your OpenID Connect Provider. The catalog will then verify access tokens against this provider. The provider must have the .well-known/openid-configuration endpoint under ${ICEBERG_REST__OPENID_PROVIDER_URI}/.well-known/openid-configuration and the openid-configuration needs to have the jwks_uri and issuer defined.

If ICEBERG_REST__OPENID_PROVIDER_URI is set, every request needs have an authorization header, e.g.

curl {your-catalog-url}/catalog/v1/transactions/commit -X POST -H "authorization: Bearer {your-token-here}" -H "content-type: application/json" -d ...
VariableExampleDescription
ICEBERG_REST__OPENID_PROVIDER_URIhttps://keycloak.local/realms/{your-realm}OpenID Provider URL, with keycloak this is the url pointing to your realm, for Azure App Registration it would be something like https://login.microsoftonline.com/{your-tenant-id-here}/v2.0/. If this variable is not set, endpoints are not secured

License

Licensed under the Apache License, Version 2.0