Awesome
dag-stack
Demo data pipeline with dbt, Airflow, Great Expectations.
See another possible architecture for this at https://github.com/astronomer/airflow-dbt-demo
☕ Buy me a coffee ☕
If you enjoy this workshop and want to say thanks, you can buy me a coffee here: https://www.buymeacoffee.com/sambail Thank you 😄
How to run
This repo contains a runnable demo using Astronomer (containerized Airflow), which is a convenient option to run everything in a Docker container.
- Install the Astronomer CLI (containerized Airflow), instructions here
- Note: If you only want to run Airflow locally for development, you do not need to sign up for an Astronomer Cloud account. Simply follow the instructions to install the Astronomer CLI.
- Run
astro dev start
to start up the Airflow Docker containers- I had to follow the Docker config instructions here to handle a "buildkit not supported" error
- I also had to reduce the number of
AIRFLOW__WEBSERVER__WORKERS
in the Dockerfile as well as allocate more resources to Docker in order for the webserver to run on my very old very slow laptop :) (2013 MacBook Air ftw) - Thanks to this post for the
agate
version pin to work with dbt
- This will start up the Airflow scheduler, webserver, and a Postgres database
- Once the webserver is up (takes about 30 seconds), you can access the Airflow web UI at
localhost:8080
- You can run
astro dev stop
to stop the container again
You can also run the DAG in this repo with a standard Airflow installation if you want. You'll have to install the relevant dependencies (Airflow, dbt, Great Expectations, the respective operators, etc) and probably handle some more configurations to get it to work.
Development
In order to develop the dbt DAG and Great Expectations locally instead of in the containers (for faster dev loops), I created a new virtual environment with and installed relevant packages wit pip install -r requirements.txt
dbt setup
- Ran
dbt init dbt
to create the dbt directory in this repo - I copied
~/.dbt/profiles.yml
into the root of this project and added the Astronomer postgres creds to have a database available -- you wouldn't use this database in production or keep the file in the repo, this is just a shortcut for this demo!! - The
profiles.yml
target setup allows me to run the dbt pipeline both locally and within the container:- Container:
- connect to shell within the scheduler container
- run
cd dbt
- run
dbt run --profiles-dir /usr/local/airflow --target astro_dev
- Local:
- run
cd dbt
- run
dbt run --profiles-dir /Users/sam/code/dag-stack --target local_dev
- run
- Container:
Great Expectations setup
- Ran
great_expectations init
to create the great_expectations directory in this repo - Created Datasources for the
data
directory and the Astronomer postgres database using thegreat_expectations datasource new
command- Note that I have two Datasources for the different host names, similar to the two dbt targets
- I copied the credentials from
uncommitted/config_variables.yml
into thedatasources
section ingreat_expectations.yml
for the purpose of this demo, since theuncommitted
directory is git-ignored
- Created new Expectation Suites using
great_expectations suite scaffold
against thedata_dir
andpostgres_local
Datasources and manually tweaked the scaffold output a little usingsuite edit
Airflow DAG setup
- I'm using the custom dbt and Great Expectations Airflow operators, but this could also be done with Python and bash operators
- Note that the source data and loaded data validation both use the same Expectation Suite, which is a neat feature of Great Expectations -- a test suite can be run against any data asset to assert the same properties
Serving dbt and Great Expectations docs
- The DAG contains example tasks that copy the docs for each framework into the
include
folder in the container which is mapped to the host machine, so you can inspect them manually - In production (and when deploying the container to Astronomer Cloud), both docs could (should) be copied to and hosted on an external service, e.g. on Netlify or in an S3 bucket
Additional resources
This repo is based on several existing resources:
- Great Expectations Airflow + dbt tutorial (which I had originally built)
- The example DAGs in the Great Expectations Airflow Provider (which I also originally built haha)
- Building a Scalable Analytics Architecture with Airflow and dbt