Awesome
dbt - Dimensional Modeling, Semantic Layer, Metrics and Business Intelligence
This dbt project is database agnostic, requires no data sources, and focuses on the following concepts:
- Dimensional Modeling - dbt data transformations that highlight multi-dimensional (star schema) modeling concepts using the Kimball methodology
- Semantic Layer - leverage dbt as a semantic layer to define metrics, dimensions, aggregations, calculations, data relationships, business-friendly names and descriptions, synonyms, formatting and other data catalog attributes
- Business Intelligence - connect Business Intelligence tools to the dbt semantic layer for analysis, focusing on standards for "Semantic-free" BI integration with dbt that enables BI tools to "speak the same language"
Tested on both dbt Core and dbt Cloud versions 1.0+. Database agnostic and has been tested with the Postgres, BigQuery, Snowflake, SQL Server, and Redshift adapters.
dbt 1.7+ - this project requires dbt versions 1.7+. If you'd like to try with dbt < 1.7, then use previous releases of this project.
Create dbt Project
If you're just getting started learning dbt, then you may want to look at Getting Started with dbt Core or dbt Cloud
Choose a data warehouse
-
BigQuery - follow the dbt BigQuery instructions to set up a BigQuery account and connect dbt
-
Snowflake - follow the dbt Snowflake instructions to set up a Snowflake trial account and connect dbt
-
Redshift - follow the dbt Redshift instructions to set up a Redshift trial account and connect dbt. You can use specific dbt Redshift configurations. Pay special attention to recommended distribution styles for dimensional modeling on Redshift.
-
Postgres - follow these instructions to set up a free Postgres instance on Heroku
dbt Client
To run this project (assuming you have dbt installed):
- Clone this repo
- Set up a profile to connect to your database
- Run
dbt deps
- Run
dbt build
, which does the following: seed, test, run, snapshot - Run
dbt source freshness
- Run
dbt docs generate
- Run
dbt docs serve
(if you want to run a local docs server)
dbt Cloud
-
Login to dbt or Signup for a new dbt Account
Team Account (free trial) is required for BI tools to use the Metadata API
First time using dbt Cloud? Review the dbt Cloud Quickstart guide
-
Create a new project
a. New accounts create a new project immediately
b. Existing accounts need to go to "Account Settings > Projects" and click New Project
-
Give the project a name, e.g. "Sales Project"
-
Create a connection to either Snowflake, BigQuery, PostgreSQL, SQL Server, or Redshift
-
Connect a repository with either Git Clone or Github
a. Git Clone - enter Git URL as
https://github.com/flexanalytics/dbt-business-intelligence
b. Github - fork this repo, set up a dbt GitHub connection, and then choose the forked repository
-
Click Start developing to enter the develop portal in dbt Cloud
-
In the command line at the bottom, run
dbt deps
-
Run
dbt build
-
Run
dbt source freshness
-
If all runs OK, then Set up an Environment
Make sure it is a deployment environment
-
In Execution Settings, check Generate Docs and Run Source Freshness
-
Run the new job
Connect Business Intelligence tool
To see how a Business Intelligence tool can integrate with dbt using these standards, check out one of these:
- FlexIt Analytics
- Add more here
BI Standards
BI vendors and dbt projects should follow these standards to "speak the same language" for a semantic-free BI experience. The following sections highlight how these standards should be applied to dbt projects.
Lineage
Models must use ref
and source
for lineage to be accessible
select
{{ dbt_utils.generate_surrogate_key(['order_number','order_line_number']) }} order_key,
*
from {{ ref('stg_order') }}
select distinct
order_number,
order_line_number,
order_date,
status
from {{ source('salesforce', 'stg_sales_data') }}
Metrics
Define metrics using the meta
tag
models:
- name: fact_sales
columns:
- name: quantity
meta:
metrics:
total_quantity:
label: 'Total Quantity'
type: sum # sum, average, count, count_distinct, min, max
description: "Sum of quantity sold"
format:
type: number
decimalPlaces: 0
Business (Friendly) Names
Models and model columns use the meta
tag and define a label
attribute, as such:
models:
- name: dim_customer
meta:
label: Customer
columns:
- name: customer_name
meta:
label: Customer Name
Descriptions
The description
attribute is standard on models, columns, metrics, etc. and allow for a detailed definition.
models:
- name: fact_sales
description: "{{ doc('fact_sales') }}" # references docs block
- name: dim_order
description: "All your orders here" # standard description
columns:
- name: order_date
description: "Date of the order from {{ var('company_name') }}" # uses a var
You can store descriptions in a markdown
file and reference with {{ doc('fact_sales') }}
, which references this markdown file:
{% docs fact_sales %}
| Metric | Definition |
|----------------|--------------------------------------------------|
| quantity | Sum of quantity sold |
| price | Average price for an ordered product |
| sales | Sum of total sales revenue |
| target | Sum of target sales revenue |
{% enddocs %}
Report Dependencies
Report and dashboard dependencies are called exposures in dbt. Exposures allow you to bring together all the downstream usage of a model (BI, ML, or the many other tools an organization has).
Exposures can be part of schema.yml
or managed in separate files (e.g. exposures.yml
).
Use the
meta.tool
to define the tool
exposures:
- name: Area Chart (100% Stacked)
type: analysis
url: https://cloud.flexitanalytics.com/#analysis/area-stacked-100
description: ''
depends_on:
- ref('dim_order')
- ref('dim_product')
- ref('fact_sales')
owner:
name: Andrew Taft
email: some@email.address
meta:
tool: flexit
Joins
The preferred method uses dbt’s relationships
test, which serves two purposes:
- Tests referential integrity, ensuring your data is correct
- Allows BI tools to know relationships to joining entities
Dbt’s relationships
test works great for well modeled data (e.g. dimensional star schema), but there may be use cases for more complex joins (multi-column, expressions other than equals, etc.). In this case, we can leverage dbt’s meta property. Here’s an example of both ways to handle joins:
models:
- name: fact_sales
meta:
label: Sales Fact
joins:
- to: dim_customer
type: inner #inner, left, right, full; default 'inner'
join_on:
- from_field: customer_key
exp: '=' #optional; default '='; '=','<>'
to_field: customer_key
columns:
- name: customer_key
tests:
- relationships: #JOINS the dbt integrity way
to: ref('dim_customer')
field: customer_key
Another way for tools to derive joins from dbt models is by defining primary and foreign keys. While many columnar databases (e.g. Snowflake) do not enfore referential integrity (primary and foreign keys), they can still be defined so that tools can pick them up. For traditional relational databases, it can be important to define primary and foreign keys for performance reasons as well.
Important: to use primary and foreign key constraints, set the dbt_project.yml variable "dbt_constraints_enabled: true"
Note: dbt 1.5+ supports primary_key and foreign_key constraints. However, as of dbt-core v1.5.2, constraints require contracts to be enabled. The catch-22 is that contracts then require ALL model columns to define data_type, but data_type is database-specific, so you lose the incredible database agnostic features of dbt. For this reason, we use the
dbt_constraints
package to enable cross-database support for primary and foreign keys. The issue will be tracked here.
Data Freshness
BI tools should use dbt's built in run/job update timestamps and source freshness to display data freshness information to end users.
You can also use dbt Dashboard Tiles to place freshness “tiles” into dashboards, giving users the ability to see high-level freshness and drill into detail, if necessary.
Statistics
Information such as Number of rows
or Size
of a table/model can be very informative and build trust in the data. If the underlying data source supports these statistics, then dbt makes stats
available on models and then the BI tool can make them available to end users.
Synonyms
Synonyms (aliases) allow you to assign multiple potential names to a single column. For example, sales might be called revenue, receipts, proceeds, or other names depending on who you ask. Synonyms can be used for search and Natural Language Query (NLQ).
This is not a build-in dbt feature, so we use the meta tag, as shown here:
metrics:
- name: sales
label: Sales
model: ref('fact_sales')
description: "Sum of total sales revenue"
type: sum
sql: sales
timestamp: date_day
time_grains: [day, week, month, year]
meta:
synonyms: # for business searching and natural language processing
- Revenue
- Proceeds
- Receipts
Formatting
Define column formatting for numbers, dates, currency, and text
models:
- name: monthly_sales
columns:
- name: revenue
meta:
format:
type: number
decimalPlaces: 2
prefix: "$"
displayUnits: auto # auto, thousands, millions, billions, trillions
- name: quantity
meta:
format:
type: number
suffix: " units sold"
- name: date_day
meta:
format:
type: date
pattern: 'MMM D, YYYY'
- name: mrr_change
meta:
format:
type: percent
divideBy100: true
- name: descr
meta:
format:
type: text
showNullAs: 'No Value Entered'
Depending on type, possible attributes are:
- decimalPlaces
- prefix
- suffix
- displayUnits
- showNullAs
- pattern (date only)
- divideBy100 (percent only)
Hidden
Use the meta.hidden
attribute on models, model columns, and metrics to hide it from being displayed to end users in the BI tool.
models:
- name: stg_dummy_table
meta:
hidden: true
Display Order
Use the display_index
meta tag to place models and columns in a specific order
models:
- name: fact_sales
meta:
label: Sales Fact
display_index: 1 # display first
Concepts
Here is a list of concepts that were covered in this repo.
dbt Concepts
-
Database agnostic - how to use macros and packages to create models that work across database platforms
-
Multi-dimensional Modeling - how to create multi-dimensional models using the Kimball methodology
-
Date Dimension - build a database agnostic date dimension using the
dbt-date
package. -
Meta - using the meta config for downstream BI tools to handle joins, formatting, aliases, model and column order, hide/display columns and models, and more
-
Tests - how to use singular and generic tests, including relationships tests for referential integrity
-
Packages
-
Jinja & Macros - using Jinja & Macros to create re-usable code
-
Seeds - use Seeds to load source data into the data warehouse
-
Exposures - document downstream model dependencies, such as reports and dashboards, using Exposures
-
Custom Schemas - how to use Custom Schemas to organize models (e.g. staging tables)
-
Documentation - documenting dbt models for optimized downstream discovery in BI tools
-
Sources - defining Sources and using the source freshness features
-
Snapshots - use Snapshots to maintain a history of record changes in a table over time
-
Incremental Models - use dbt's incremental materialization where possible to reduce build time (can be overridden by passing the
--full-refresh
flag) -
Pre-hook & Post-hook - use post-hooks to run SQL after a model is built. For example, to create indexes, primary/foreign keys, grant permissions.
-
Best Practices - follow dbt Best Practices
Business Intelligence Concepts
-
Semantic-free BI - bypass defining metadata models at the BI semantic layer
-
Data Literacy - create a full data dictionary with source controlled definitions using dbt documentation
-
Lineage - instantly see upstream model and source dependencies with the ability to view and run upstream SQL
-
Joins - entity relationships defined by dbt tests
relationships
or modelmeta
-
Formatting - define universal formatting for number, string, date and other data types
-
Data Freshness - from reports and models, view model refresh date/time and source data freshness
-
Natural Language Query (NLQ) - define model
meta.synonyms
(a.k.a aliases) to be used by natural language query or generic searching -
Report dependencies - define model dependencies for reports and dashboards using Exposures
Dimensional Modeling Concepts
-
Slowly Changing Dimensions (SCD) - shows type 0, type 1 and type 2 slowly changing dimensions
-
Surrogate Keys - how to create surrogate keys using md5() hash
-
Date Dimension - build a full-featured Date Dimension
-
Fact and Dimension tables - create standard fact (measure) and dimension tables with constraints
Semantic Models
- Semantic Model Configuration - define semantic models in semantic_models.yml allow for unified analytics through standardized metrics.
- This has a few pre-requisites, including using dbt cloud on at least a Team level
-
- expr
- entities
- measures
- dimensions
- metrics
- saved_queries
- exports
- (obviously) have to push and run from cloud
[beta] Observability
This project includes observability via the dbt_observability and dbt_observability_marts packages for a robust observability environment. This is set to only run in a prod
environment by default, but can be modified by updating the dbt_observability:environments
variable. A full observability-related example config is below.
vars:
...
"dbt_observability:tracking_enabled": true # optional, create observability base tables - default is true
"dbt_observability:environments": ["prod"] # optional, default is ["prod"]
"dbt_observability:path": "models/marts/" # optional, which paths should observability monitor. must be in the form of "dbt_observability:path": "path/subpath/" - default is `None`, will run on all paths in the project
"dbt_observability:materialization": ["table","incremental"] # optional, which model materialization should observability run on. must be array of "table", "view", "incremental", "ephemeral" - default is ["table","incremental"]
"dbt_observability:track_source_rowcounts": false # optional, track source rowcounts - default is false [depending on your dbms, this can be slow and resource intensive as it may require a full table scan if the dbms does not store rowcounts in information_schema.tables]
"dbt_observability:rowcount_diff_threshold_pct": .05 # optional, set threshold for no_anomaly test
...
models:
...
dbt_observability:
+database: your_destination_database # optional, default is your target database
+schema: your_destination_schema # optional, default is `observability`
staging:
+database: your_destination_database # optional, default is your target database
+schema: your_destination_schema # optional, default is `observability`
sources:
+database: your_sources_database # optional, default is your target database
+schema: your sources_database # optional, default is `observability`
This will enable observability base tables to get metadata on your dbt models, tests, seeds, snapshots, and runs. dbt_observability_marts
then creates a set of staging tables and kimball-style dimension and fact tables as well as canned observability reports via views which can be built upon for your own observability reporting. Make sure to keep an eye on new releases from dbt_observability and dbt_observability_marts as new (and potentially breaking) changes are introduced.
Let's Collaborate
With Business Intelligence standards for integrating with dbt, we can enable all BI tools to speak the same language and offer a 100% semantic-free BI experience. Join us in creating those standards!