Date: 2024-05-07
Author: MvS
keywords: data engineering, dbt, data build tool, data warehouse, linter, code style
An enhanced working environment to develop and deploy dbt models using code quality tooling for its various moving parts.
The description from the project website sums it up perfectly:
dbt is a transformation workflow that helps you get more work done while producing higher quality results. You can use dbt to modularize and centralize your analytics code, while also providing your data team with guardrails typically found in software engineering workflows. Collaborate on data models, version them, and test and document your queries before safely deploying them to production, with monitoring and visibility.
dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions. This single source of truth, combined with the ability to define tests for your data, reduces errors when logic changes, and alerts you when issues arise.
This project was created for:
- Ubuntu linux but should be fairly easily portable to other OS'es.
- It uses Python version 3.10 in a separate Python environment which also supports Jupyter notebooks.
- Node installation v.18.18 or greater, see
nvm ls
, - It also uses Gnu make v. 4.3.
-
In the developer's home directory a dbt connection file, called
profile.yml
will have to be set up. -
We assume that either the
~/.dbt/profile.yml
will point to a pre-existing relational DB/lake-/warehouse data storage or the developer is familiar with setting up, e.g., a Docker container with a postgreSQL instance on his laptop. -
We advise to use admin role credentials on a dev DB instance for fast prototyping and reduce this to minimal CRUD permissions specifically to the AOI on a production system.
-
We encourage to connect to any DB admin tool of choice, in parallel of the ongoing dbt coding, e.g., DBeaver. This can be used to test/filter new model components.
-
The Makefile in the repository's root is designed to help set up and manage the development environment. It includes tasks for setting up the Python/Node environment and testing the executables for running linters, formatting code, and cleaning up:
make setup_env
Install local Python/Node environment with dbt, linters, etc. This Makefile target installs all necessary dependencies listed inrequirements-dbt.txt
andpackage.json
.make verify_install
Check environment installation and command versions.make clean
Clean up environment files - removes the Python virtual environment and Node modules.
Run make <target>
to execute the desired target. For example, use make setup_env
to set up the environment.
-
The environment variable
DBT_PROJECT_NAME
does not need to be changed here as each dbt project contains its own Makefile for specific use with the Python/Node environment. -
This approach was chosen to allow an individual customization of
dbt run
/dbt test
for each dbt project's data modelling layer structure.
This example follows the first steps to set up a new project and adds another transformation
using dbt macros, see, ./models/loading
. It does not contain another Makefile for the sake
of illustrating the main commands and simplicity.
-
Consider that the dbt environment has been set up correctly and a database connection has been established to a database with sufficient permissions to create schemas/tables/views.
-
Using your DB admin tool, manually create a
source
schema on the database instance in database pg_source and populate it with a table containing some data, see script. -
(Skipable): Creating a new dbt project:
- Create a project folder <project_name> to host all dbt data engineering and use this directory.
- Run
dbt init
and fill out the required info on DB connectivity, specify <project_name> and target schema which dbt will work on, pulling data from the source. A project environment will created in the directory and a~/.dbt/profile.yml
will store the connectivity information. Both will be linked via adbt_project.yml
file in the project root.
-
Run
cd pg_source && dbt debug
to verify that the connection to the postgres instance is working. -
Observe the fine-grained, recursive configuration options within each
*.yml
file which can be overwritten by more deeply-nested model layers definitions. -
Run
dbt run --select "models/example/my_first_dbt_model.sql"
for a "hello world" experience. -
After a run has concluded it would be appropriate to use a testing workflow to verify the integrity of the layers and tables generated, e.g.,
dbt test --select "models/example/my_first_dbt_model.sql"
invoking generic tests on certain fields of a table. The testing criteria are specified in theschema.yml
of the respective data model component. There are options for singular test cases, generic test cases, and unit tests. -
Run
dbt deps
to install additional modules, specified inpackages.yml
which provide advanced functionality beyond the scope ofdbt-core
, e.g. timestamp functions of thedbt_date
package. In this way another model can be run usingdbt run --select "models/loading/my_first_transformation.sql"
This project has been taken from dbt's Github page and slightly extended:
-
Note, as mentioned before, that for most commands a target database has to be specified, e.g.,
dbt run --target dev
. In the~/.dbt/profiles.yml
we can define connections[local, dev, prod]
and default targets that would usually point todev
orlocal
instances.
One should deploy new model code to production systems after rigorous testing, only. -
Switch directory to the dbt model and confirm that the connection is working:
cd jaffle_shop && dbt debug
-
To source a simple start schema data mart we upload, or seed, a set of
*.csv
data files found in the./seeds
folder to the database usingdbt seed
. This simulates the landing stage of the data lake/warehouse. -
We build the full data model by calling
dbt run
which will try to create all subsequent tables and views based on the specification found in./models
consisting of*.sql
,*.yaml
definitions and jinja templating configuration. We can use a command likedbt run --select staging
to construct only parts of the model up to a certain level or to re-create a particular table. -
We can do a dry-run of the
dbt run
, e.g., by callingdbt show --select "model_name.sql"
which will run the query against the database but not write to the database. We can also run ad-hoc queries against the DBdbt show --inline "select * from {{ ref('raw_customers') }}"
. -
A powerful feature of dbt is the automatic generation of structured documentation for each model realization based on the technology used. In this way the documentation for the production database will differ from the development instance, e.g., postgreSQL (dev) and snowflake (prod). Use
dbt docs generate --target dev
to generate/update the files. You can even rundbt docs serve --target dev
to create a local website to be displayed in the web browser. Noteworthy files are:manifest.json
: Containing a full representation of your dbt project's resources (models, tests, macros, etc).catalog.json
: Serves information about the tables and views produced and defined by the resources in your project.
-
Quickly, also mentioning the
Makefile
in this project which carries more weight in larger dbt projects:- Running
make dbt_run_refresh
will fully seed, recreate, test each individual model stage, here, they are called marts and staging one after the other and stopping the build process upon failure. - Running
make dbt_run
will recreate and test the models only by their increment, which carries no weight in this configuration. See incremental loading for further details. - All the code quality features found under
make help
are also available.
- Running
The data set used here contains transactional data from a retail store covering:
- Consists of 12 months of transactions, approx. 500k rows,
- It covers orders and partial cancellations/returns of orders,
- Approximately 4000 product items for home furnishings and decorations,
- Recurring orders by approx. 4000 customers from 30 countries.
It was kindly provided through the Kaggle community. It turned out that the data also contains stock management information, when items were lost/destroyed/damaged.
To properly make use of these data it became apparent that a certain level data cleaning is required during
dbt's loading process.
The result of an explorative data analysis and wrangling can be found in analyses/2024-07-17_preparing_seeds.ipynb
.
The seed files provided in seeds/*csv
can be re-created using this notebook plus some feature engineering by LLMs.
In this project most things run out of the box, but some initial setup is required still:
-
The Kaggle data needs to be manually imported to the DB using a DB Admin tool and a scripted approach. One can opt to either use the 500k rows of data as a whole or split them into packages defined by a time interval. There is a script provided to facilitate that task.
-
Once the data is "manually" ingested it can be processed by pointing the
models/sources_properties.yml
to the proper tables and selecting the respective references instaging/0101_transactional_data.sql
andstaging/0102_transactional_nulls.sql
. We strongly advise to replicate the initial setup before making individual customizations. -
Similar to the previous project example there are
Makefile
targets to simplify a structurally consistent deployment:- Running
make dbt_run_refresh
will fully seed, recreate, test each individual model stage:
stage, core, mart and stats point to respective schemas in the DB instances database. The build process stops upon failure. - Running
make dbt_incremental
will add any data to realized models of that flavor. The common scenario is that a new batch of ingest data only contains a few lines and does not require a full rebuild of the data model. We can either simulate this be adding a new set of data to the source table by SQLINSERT
or by simulating a incremental condition, e.g.make dbt_incremental REFRESH="--vars 'overlap_interval: \"2 Days\"'"
as this parameter has been coded into the incremental models for this show case.
- Running
-
All the code quality features found under
make help
are also available. We advise to employmake sqlfluff_lint
andmake prettier_reformat
ahead of a pull/merge request in order to bring consistency to the SQL/YAML code base.
-
Install the command completion for dbt as specified, here.
cd ~ curl https://raw.githubusercontent.com/fishtown-analytics/dbt-completion.bash/master/dbt-completion.bash > ~/.dbt-completion.bash echo 'source ~/.dbt-completion.bash' >> ~/.bash_profile
-
When using VScode we recommend to install the dbt extension Power User for dbt Core to gain:
- Code completion, function hinting, model references, etc.
- Ability to run dbt model queries inside of VS code out of their respective
model.sql
file by Ctrl + Return. - Display of data lineage.
- ...
-
Install a dbt packages like dbt-utils by adding a
packages.yml
ordependencies.yml
to the dbt project root and pull it into the project by runningdbt deps
. Shortly afterwards, the Jinja templating engine will offer the respective macro functionalities.
Convenience features included in this repo are:
-
A Makefile to set up the environments and call the features below.
-
Automatic Python code reformatting following PEP8 using black,
- Invoked and applied manually via,
black .
and as a dry run usingblack --check .
- Invoked and applied manually via,
-
Python code linting using flake8,
- Flake8 linter integrated in aforementioned git hooks (and configured in setup.cfg)
-
Using sqlfluff as a linter will boost the dbt code quality especially given the lack of enforced standardization and evolving SQL habits within a dev team:
-
Start with package installation
source <dbt-env> pip install sqlfluff sqlfluff-templater-dbt
-
Add a
.sqlfluff
config file to the project root. Note: the SQL dialect and the project dir need to match the technologies used in the backend. ToDo: use jinja templating to add flexibility:[sqlfluff] dialect = postgres templater = dbt runaway_limit = 10 max_line_length = 80 indent_unit = space [sqlfluff:templater:dbt] project_dir = <dbt_project-dir> [sqlfluff:indentation] tab_space_size = 4 [sqlfluff:layout:type:comma] spacing_before = touch line_position = trailing [sqlfluff:rules:capitalisation.keywords] capitalisation_policy = lower [sqlfluff:rules:aliasing.table] aliasing = explicit [sqlfluff:rules:aliasing.column] aliasing = explicit [sqlfluff:rules:aliasing.expression] allow_scalar = False [sqlfluff:rules:capitalisation.identifiers] extended_capitalisation_policy = lower [sqlfluff:rules:capitalisation.functions] capitalisation_policy = lower [sqlfluff:rules:capitalisation.literals] capitalisation_policy = lower [sqlfluff:rules:ambiguous.column_references] # Number in group by group_by_and_order_by_style = explicit
-
Add a
.sqlfluffingnore
config file to the project root:**/dbt_packages/ **/target/ <python_env>/
-
From the git project root, either invoke
sqlfluff lint
to list the code quality issues in the current SQL code or runsqlfluff fix
to auto-correct the issues on the fly. Note, that some quality issues will still have to be fixed manually but the linter will usually point to the right direction.
-
-
Prettier formatting of all
*.yaml
configuration files. Prettier's own configuration is specified within the.prettierrc
file.- A dry-run invoked manually via,
npx prettier . --check
and executed usingnpx prettier . --write
.
- A dry-run invoked manually via,
-
Pre-commit hooks to check for fundamental issues with the code base before adding them to git,
- Invoked manually via,
pre-commit run --all-files
.
- Invoked manually via,