Home

Awesome

Ethereum 2.0 ETL Airflow

Airflow DAGs for exporting and loading the Ethereum 2.0 blockchain data to Google BigQuery. Data is available for you to query right away in Google BigQuery.

Prerequisites

Setting Up

  1. Create a GCS bucket to hold export files:

    gcloud config set project <your_gcp_project>
    PROJECT=$(gcloud config get-value project 2> /dev/null)
    ENVIRONMENT_INDEX=0
    BUCKET=${PROJECT}-${ENVIRONMENT_INDEX}
    gsutil mb gs://${BUCKET}/
    
  2. Create a Google Cloud Composer environment:

    ENVIRONMENT_NAME=${PROJECT}-${ENVIRONMENT_INDEX} && echo "Environment name is ${ENVIRONMENT_NAME}"
    gcloud composer environments create ${ENVIRONMENT_NAME} --location=us-central1 --zone=us-central1-a \
        --disk-size=30GB --machine-type=custom-1-4096 --node-count=3 --python-version=3 --image-version=composer-1.8.3-airflow-1.10.3 \
        --network=default --subnetwork=default
    
    gcloud composer environments update $ENVIRONMENT_NAME --location=us-central1 --update-pypi-package=ethereum2-etl==0.0.4
    

    Note that if Composer API is not enabled the command above will auto prompt to enable it.

  3. Follow the steps in Configuring Airflow Variables to configure Airfow variables.

  4. Follow the steps in Deploying Airflow DAGs to deploy Airflow DAGs to Cloud Composer Environment.

  5. Follow the steps here to configure email notifications.

Configuring Airflow Variables

Airflow Variables

Note that the variable names must be prefixed with {chain}_, e.g. mainnet_output_bucket.

VariableDescription
output_bucketGCS bucket where exported files with blockchain data will be stored
export_start_dateexport start date, default: 2018-06-30
export_end_dateexport end date, used for integration testing, default: None
export_schedule_intervalexport cron schedule, default: 0 1 * * *
provider_uriscomma-separated list of provider URIs for ethereum2etl commands
notification_emailscomma-separated list of emails where notifications on DAG failures, retries and successes will be delivered. This variable must not be prefixed with {chain}_
export_max_active_runsmax active DAG runs for export, default: 3
export_max_workersmax workers for ethereum2etl command, default: 5
destination_dataset_project_idGCS project id where destination BigQuery dataset is
load_schedule_intervalload cron schedule, default: 0 2 * * *
load_end_dateload end date, used for integration testing, default: None

Creating a Cloud Source Repository for Airflow variables

It is recommended to keep airflow_variables.json in a version control system e.g. git. Below are the commands for creating a Cloud Source Repository to hold airflow_variables.json:

REPO_NAME=${PROJECT}-airflow-config-${ENVIRONMENT_INDEX} && echo "Repo name ${REPO_NAME}"
gcloud source repos create ${REPO_NAME}
gcloud source repos clone ${REPO_NAME} && cd ${REPO_NAME}

# Put airflow_variables.json to the root of the repo

git add airflow_variables.json && git commit -m "Initial commit"
git push

# TODO: Setup Cloud Build Trigger to deploy variables to Composer environment when updated. For now it has to be done manually.

Deploying Airflow DAGs

Integration Testing

It is recommended to use a dedicated Cloud Composer environment for integration testing with Airflow.

To run integration tests:

Troubleshooting

To troubleshoot issues with Airflow tasks use View Log button in the Airflow console for individual tasks. Read Airflow UI overview and Troubleshooting DAGs for more info.

In rare cases you may need to inspect GKE cluster logs in GKE console.

Speed up the initial export

To speed up the initial data export it is recommended to use n1-standard-2 instance type for the Cloud Composer cluster. After the initial export is finished a new cluster with custom-1-4096 should be created with export_start_date Airflow variable set to the previous date.