Awesome
Tezos ETL Airflow
Airflow DAGs for exporting and loading the Tezos blockchain data to Google BigQuery. Data is available for you to query right away in Google BigQuery.
Prerequisites
- linux/macos terminal
- git
- gcloud
Setting Up
-
Create a GCS bucket to hold export files:
gcloud config set project <your_gcp_project> PROJECT=$(gcloud config get-value project 2> /dev/null) ENVIRONMENT_INDEX=0 BUCKET=${PROJECT}-${ENVIRONMENT_INDEX} gsutil mb gs://${BUCKET}/
-
Create a Google Cloud Composer environment:
ENVIRONMENT_NAME=${PROJECT}-${ENVIRONMENT_INDEX} && echo "Environment name is ${ENVIRONMENT_NAME}" gcloud composer environments create ${ENVIRONMENT_NAME} --location=us-central1 --zone=us-central1-a \ --disk-size=30GB --machine-type=custom-1-4096 --node-count=3 --python-version=3 --image-version=composer-1.8.3-airflow-1.10.3 \ --network=default --subnetwork=default gcloud composer environments update $ENVIRONMENT_NAME --location=us-central1 --update-pypi-package=tezos-etl==1.2.1
Note that if Composer API is not enabled the command above will auto prompt to enable it.
-
Follow the steps in Configuring Airflow Variables to configure Airfow variables.
-
Follow the steps in Deploying Airflow DAGs to deploy Airflow DAGs to Cloud Composer Environment.
-
Follow the steps here to configure email notifications.
Configuring Airflow Variables
- For a new environment clone Tezos ETL Airflow:
git clone https://github.com/blockchain-etl/tezos-etl-airflow && cd tezos-etl-airflow
. For an existing environment use theairflow_variables.json
file from Cloud Source Repository for your environment. - Edit
airflow_variables.json
and update configuration options with your values. You can find variables description in the table below. For themainnet_output_bucket
variable specify the bucket created on step 1 above. You can get it by runningecho $BUCKET
. - Open Airflow UI. You can get its URL from
airflowUri
configuration option:gcloud composer environments describe ${ENVIRONMENT_NAME} --location us-central1
. - Navigate to Admin > Variables in the Airflow UI, click Choose File, select
airflow_variables.json
, and click Import Variables.
Airflow Variables
Note that the variable names must be prefixed with {chain}_
, e.g. mainnet_output_bucket
.
Variable | Description |
---|---|
output_bucket | GCS bucket where exported files with blockchain data will be stored |
export_start_date | export start date, default: 2018-06-30 |
export_end_date | export end date, used for integration testing, default: None |
export_schedule_interval | export cron schedule, default: 0 1 * * * |
provider_uris | comma-separated list of provider URIs for tezosetl export command |
notification_emails | comma-separated list of emails where notifications on DAG failures, retries and successes will be delivered. This variable must not be prefixed with {chain}_ |
export_max_active_runs | max active DAG runs for export, default: 3 |
export_max_workers | max workers for tezosetl export command, default: 30 |
destination_dataset_project_id | GCS project id where destination BigQuery dataset is |
load_schedule_interval | load cron schedule, default: 0 2 * * * |
load_end_date | load end date, used for integration testing, default: None |
Creating a Cloud Source Repository for Airflow variables
It is recommended to keep airflow_variables.json in a version control system e.g. git. Below are the commands for creating a Cloud Source Repository to hold airflow_variables.json:
REPO_NAME=${PROJECT}-airflow-config-${ENVIRONMENT_INDEX} && echo "Repo name ${REPO_NAME}"
gcloud source repos create ${REPO_NAME}
gcloud source repos clone ${REPO_NAME} && cd ${REPO_NAME}
# Put airflow_variables.json to the root of the repo
git add airflow_variables.json && git commit -m "Initial commit"
git push
# TODO: Setup Cloud Build Trigger to deploy variables to Composer environment when updated. For now it has to be done manually.
Deploying Airflow DAGs
- Get the value from
dagGcsPrefix
configuration option from the output of:gcloud composer environments describe ${ENVIRONMENT_NAME} --location us-central1
. - Upload DAGs to the bucket. Make sure to replace
<dag_gcs_prefix>
with the value from the previous step:./upload_dags.sh <dag_gcs_prefix>
. - To understand more about how the Airflow DAGs are structured read this article.
- Note that it will take one or more days for
mainnet_export_dag
to finish exporting the historical data. - To setup automated deployment of DAGs refer to Cloud Build Configuration.
Integration Testing
It is recommended to use a dedicated Cloud Composer environment for integration testing with Airflow.
To run integration tests:
- Create a new environment following the steps in the Setting Up section.
- On the Configuring Airflow Variables step specify the following additional configuration variables:
export_end_date
:2018-06-30
load_end_date
:2018-06-30
- This will run the DAGs only for the first day. At the end of the load DAG the verification tasks will ensure the correctness of the result.
Troubleshooting
To troubleshoot issues with Airflow tasks use View Log button in the Airflow console for individual tasks. Read Airflow UI overview and Troubleshooting DAGs for more info.
In rare cases you may need to inspect GKE cluster logs in GKE console.
Speed up the initial export
To speed up the initial data export it is recommended to use n1-standard-2
instance type for the Cloud Composer cluster.
After the initial export is finished a new cluster with custom-1-4096
should be created with export_start_date
Airflow variable set to the previous date.