Whether you are new to data engineering or you are quite conversant with the dataverse, you should have heard or used Data Build Tool (dbt).
Then, what is dbt
? dbt
is a transformation tool that uses 3 languages to be able to help data engineers transform data efficiently in ETL and ELT pipeline development. These three languages are:
-
Structured Query Languages(SQL): This is the main language used in
dbt
. It is used in transforming data efficiently in databases. -
YAML: This is the language used to configure
dbt
in configuration files i.e.dbt_project.yml
file which contains configuration settings for our models, tests among otherdbt
essentials. -
Jinja: This is a powerful templating engine used in
dbt
. It can also be used in other languages i.e. Python (Django, Flask) where it is combined to work with HTML files to create robust and functional web applications. Using Jinja indbt
allows for reusability for our code in various files within our project, making it very important for users. Read more about Jinja here.
Instead of writing massive SQL scripts, dbt helps you:
- Write modular SQL
- Run version-controlled transformations
- Test and document your models with easy to navigate UI.
- Embrace software engineering best practices for data analytics
dbt (data build tool) works exceptionally well with CTEs (Common Table Expressions) because they align closely with dbt’s core principles:
- clarity
- modularity
- reusability
- testability.
There is a small introduction for what dbt
is and what it entails, let's dive deeper into it and get to know its basics, install dbt-core
, write our first dbt
model and learn more about this powerful tool for data engineers.
1. Installing and initializing dbt
.
To install dbt, we need to either install dbt-core
or dbt-cloud
. For this blog, we will install dbt-core
and we will install dbt-cloud
in another post.
To work with dbt
, you need to have the following prerequisites:
- Installing Python packages in ther terminal
- Beginner to advanced SQL knowledge
- Eagerness to learn!
Before installing dbt
make sure you have installed and activated your virtual environment. This helps with preventing package version mixups and errors.
To install and activating your virtual environment according to your operating system:
python3 -m venv your_env
source your_env/bin/activate # Linux / MacOS
your_env\Scripts\activate # Windows
To install dbt
, run the following commands, following the database/data source/data warehouse that you would like to link to e.g. postgres
, redshift
, snowflake
, bigquery
.
pip install dbt-core dbt-postgres # If you choose postgres
pip install dbt-core dbt-redshift # If you choose redshift
pip install dbt-core dbt-snowflake # If you choose snowflake
pip install dbt-core dbt-bigquery # If you choose bigquery
Run the following command to initialize our dbt project:
dbt init your_project_name
cd your_project_name
This will prompt you to enter some data source credentials e.g. port, host, user and password. By running this command for the first time, it will set up .dbt/
folder in your HOME
directory for your global configuration files e.g. profiles.yml
that dbt uses to connect to your data warehouse.
This also creates very important files for us, like models
which contains our models, dbt_project
which contains our project's configuration settings, tests
for our tests among other folders.
2. Creating our first dbt
model.
To create our first model, head our to our models/
folder and create a file, name it as you like. Let's name it high_cost_houses.sql
. My data warehouse contains house data in Kenya from low range to high range cost houses. I needed to get all the high cost houses.
In dbt, when running a model, it creates a materialized view in our tables where we can go and view the results in our database. We can change that in dbt_project.yml
file, I'll show you how to do that later on, don't worry!
For our first model:
with high_cost_houses as (
select titles, locations, prices
from public.house_data
where prices >= 50000000
order by prices desc
)
select * from high_cost_houses
This query fetches the houses whose price is above 50 million. To run this model:
dbt run
This creates a view in our data warehouse and you can go check it out under Materialized views.
a. Modularity in dbt
dbt
uses both Jinja templates and SQL to ensure that we can use our models wherever in our models without repeating ourselves. This ensures that our models are reusable and maintainable. For example, to use our high_cost_houses.sql
in another model, we can use ref()
to reference to that model as so:
-- Let's get houses that are in Runda from our high_cost_houses
SELECT *
FROM {{ ref('high_cost_houses') }}
WHERE locations = 'Runda'
We have reused our model in our new model using {{ ref('high_cost_houses') }}
.
b. Materialization in dbt
Remember when I said I would help you override dbt's default materialization from a view to a table? Let's do it.
Head over to your dbt_project and change this from view to table
models:
dbt_practice:
# change this from view to table
high_cost_houses:
+materialized: table # previously: view
Tables are easier to access but take up more space than views which are cheaper to store and easier to run.
3. dbt Sources, Seeds and Analyses.
Sources are data sources from our data warehouse, seeds are csv files that we would like to load for analysis while analyses are SQL queries that are one off queries and are not materialized into our data warehouse as tables or views. If we would want to materialize them, we should move them to the models/
folder.
To reference our source in dbt, we use the source('db_name', 'table_name')
method to do so. Make sure you have set the source in our dbt_project.yml
file as so
sources:
- name: database_loading
tables:
- name: house_data_loading
Then reference it as so:
SELECT *
FROM {{ source('database_loading', 'house_data_loading') }}
4. Testing in dbt
In dbt, there are two types of tests which include:
a. Singular tests - These are tests that are run in one model and don't need to be repeated in other models. This tests actually validate outcomes of models. For example, in our high_cost_houses.sql
model, our houses should not have a price less than 50 million, so let's write a test to validate this.
Head over to the tests/
folder and create a new test file called check_prices.sql
and write the following test query:
SELECT *
FROM {{ ref(''high_cost_houses'') }}
WHERE prices < 50000000
After writing our test, run this to run our test:
dbt test
If our test fails, it means our data has some values that have a price below 50 million and should check that.
b. Generic tests - These are tests that can be performed on many tables. These tests follow the Don't Repeat Yourself (DRY) principle.
There are built-in and pre built-in generic tests in dbt. Examples of generic tests include:
- not_null
- unique
- accepted_values
- relationships
For example, we would to check if there are any locations that are empty in the high cost and low cost tables in our data warehouse. Let's write a generic test to check for any empty values using SQL and Jinja:
{% test empty_locations(model, column_name) %}
SELECT {{ column_name }}
FROM {{ model }}
WHERE TRIM{{ column_name }} = ''
{% endtest %}
To enter the test for these tables, create a test.yml
file and load the test inside as so:
models:
- name: low_cost_houses
columns:
- name: locations
tests:
- empty_locations
- name: high_cost_houses
columns:
- name: locations
tests:
- empty_locations
Then run:
dbt test
This will run our tests in our models and validate our output.
We can also load built-in generic tests in our test.yml
file without writing much code. To do so:
models:
- name: low_cost_houses
columns:
- name: locations
tests:
- not_null
- name: high_cost_houses
columns:
- name: locations
tests:
- not_null
And run:
dbt test
This will work the same way!
5. dbt
Macros
Macros are functions in dbt. They make code reusable, easier to use, modular, scalable and easier to maintain.
They are constructed using the Jinja templates:
{% macro macro_name(params) %}
--code
{% endmacro %}
Let's create a macro of calculating the average house prices per location in Kenya.
In the macros/
folder, create a file avg_house_price.sql
and write the following code:
-- Macro to get avg prices of houses per location
{% macro get_avg_prices(locations, price, table_name) %}
SELECT {{ locations }}, AVG({{ price }}) AS avg_price
FROM {{ table_name }}
WHERE {{ price }} IS NOT NULL
GROUP BY {{ locations }}
{% endmacro %}
To use our macro in our model, create a new model avg_houses_per_location.sql
in our models
folder and write the following code:
{{ get_avg_prices('locations', 'prices', 'house_data') }}
Check how easy it is to create models with macros! It makes work easier as I can reuse that macro in other models.
6. dbt
packages.
Imagine what languages would be like without packages, e.g. Python without requests
, pandas
? It would be tiring for any engineer or developer to call APIs or clean data using Python.
dbt has packages that makes our work easier with running models, testing and building data pipelines. One of the widely used dbt package is dbt expectations
and dbt utils
.
Go through the dbt Hub for more dbt packages.
Let's import dbt utils
from dbt Hub into our project.
In the packages.yml
file, configure our package as so:
packages:
- package: dbt-labs/dbt_utils
version: 1.3.0
Then run the following command in your terminal to install the package and use them in our code:
dbt deps
Now we can use dbt utils
in our code!
{{ dbt_utils.star(from=ref('users')) }}
Conclusion
dbt is quite diverse and we've covered only the fundamentals in this blog. For more information on dbt check the following links:
Thank you for reading this post, like, comment and share this post. This will be much appreciated!
Top comments (0)