Home

Awesome

IoTeX ETL

Build Status Telegram

Overview

IoTeX ETL allows you to setup an ETL pipeline in Google Cloud Platform for ingesting IoTeX blockchain data into BigQuery and Pub/Sub. It comes with CLI tools for exporting IoTeX data into JSON newline-delimited files partitioned by day.

Data is available for you to query right away in Google BigQuery.

Architecture

iotex_etl_architecture.svg

Google Slides version

  1. The nodes are run in a Kubernetes cluster. Refer to IoTeX Node in Kubernetes for deployment instructions.

  2. Airflow DAGs export and load IoTeX data to BigQuery daily. Refer to IoTeX ETL Airflow for deployment instructions.

  3. IoTeX data is polled periodically from the nodes and pushed to Google Pub/Sub. Refer to IoTeX ETL Streaming for deployment instructions.

  4. IoTeX data is pulled from Pub/Sub, transformed and streamed to BigQuery. Refer to IoTeX ETL Dataflow for deployment instructions.

Setting Up

  1. Follow the instructions in IoTeX Node in Kubernetes to deploy an IoTeX node in GKE. Wait until it's fully synced. Make note of the Load Balancer IP from the node deployment, it will be used in Airflow and Streamer components below.

  2. Follow the instructions in IoTeX ETL Airflow to deploy a Cloud Composer cluster for exporting and loading historical IoTeX data. It may take several hours for the export DAG to catch up. During this time "load" and "verify_streaming" DAGs will fail.

  3. Follow the instructions in IoTeX ETL Streaming to deploy the Streamer component. For the value in last_synced_block.txt specify the last block number of the previous day. You can query it in BigQuery: SELECT height FROM crypto_iotex.blocks ORDER BY height DESC LIMIT 1.

  4. Follow the instructions in IoTeX ETL Dataflow to deploy the Dataflow component. Monitor "verify_streaming" DAG in Airflow console, once the Dataflow job catches up the latest block, the DAG will succeed.