Home

Awesome

dbt-beyond-the-basics

CI License: MIT

A repository demonstrating advanced use cases of dbt in the following areas:

See something incorrect, open an issue!

Want to see something else included, open an issue 😉!

Continuous Integration

Continuous Integration (CI) is the process of codifying standards, these range from formatting of file contents to validating the correctness of generated data in a data warehouse.

Pre-commit

Pre-commit provides a standardised process to run CI before committing to your local branch. This has several benefits, primarily providing the developer with a quick feedback loop on their work as well as ensuring changes that do not align with standards are automatically identified before being merged. Pre-commit operates via hooks, all of these hooks are sepecified in a .pre-commit-config.yamlfile. There are several hooks that are relevant to a dbt project:

dbt Artifacts and Pytest

dbt produces 4 artifacts in the form of JSON files:

All artifacts are saved in the ./target directory by default.

These JSON files provide a valuable resource when it comes to understanding our dbt project and codifying standards. To run tests on these these files we use pytest, a python based testing framework:

The most valuable artifacts for this are catalog.json and manifest.json. Example tests include:

These tests can (and should) be run in the CI pipeline:

# ./.github/workflows/ci_pipeline.yml
- run: pytest ./tests/pytest -m no_deps

They can also be run as a pre-commit hook:

# .pre-commit-config.yaml
- repo: local
    hooks:
    - id: pytest-catalog-json
        name: pytest-catalog-json
        entry: pytest ./tests/pytest -m catalog_json
        language: system
        pass_filenames: false
        always_run: true

Coverage reports

Some of the functionality discussed above in dbt Artifacts and Pytest can be automated using dbt-coverage. This is a python package that prduces coverage reports for both documentation and, separately, for tests. All pull requests in this repo will have a comment that provides these stats. This allows PR reviewers to quickly assess if any newly added models are lacking acceptable documentation or test coverage.

dbt commands

Any CI pipeline should run several dbt commands:

All build commands should make use of the following flags:

An example dbt build command as part of the CI pipeline:

# ./.github/workflows/ci_pipeline.yml
- run: dbt --warn-error build --fail-fast

Using state:modified

As part of the CI pipeline the manifest.json artifact is generated for the feature branch, this can be compared to the manifest.json of the target branch using the state method to identify any nodes that have been modified. In addition, the use of the state:modified+ flag allows all downstream nodes to also be identified. When combined with exposures and comments in the PR this can help reviewers quickly assess the potential impact of a PR.

PR comment showing modified nodes and downstream exposures

Mart Monitor

A popular approach to CI for dbt is running Slim CI, this runs the modified nodes and all downstream nodes. This has the benefit of only testing modified nodes and therefore reducing run times and operational costs.

In certain setups it may be desireable to run the entire dbt project in every CI pipeline run. While this sounds extreme there are several methods that can be used to retain the benefits of Slim CI while benefiting from other advantages, namely the ability to provide comprehensive feedback on the impact of a PR on mart models. This can be performed via several steps:

A mart monitor that needs to be investigated further, source.

A mart monitor indicating a mart model has not changed, source.

A downside of building all models in a CI pipeline is increased run time and resource consumption. This can be restricted via pytests based on the run_results.json artifact. See ./tests/pytest/run_results.py for examples of how the duration and resource consumption of dbt build in the CI pipeline can be set to have reasonable allowable values. This provides a number of benefits:

Continuous Deployment

TODO

Dev Containers

TODO

Others

Running dbt from python

In version 1.5, dbt introduced programmatic invocations, a way of calling dbt commands natively from python including the ability to retrieve returned data. Previous ways of doing this mostly relied on opening a new shell process and calling the dbt CLI, this wasn't ideal for a lot of reasons including security. This repo further abstracts programmatic invocations to a dedicated helper function, see run_dbt_command in ./scripts/utils.py.

Conferences

This repository accompanies some conference talks: