DEV Community

Cover image for dbt for Normies - A Complete Guide.
Denzel Kanyeki
Denzel Kanyeki

Posted on

dbt for Normies - A Complete Guide.

Whether you are new to data engineering or you are quite conversant with the dataverse, you should have heard or used Data Build Tool (dbt).

Then, what is dbt? dbt is a transformation tool that uses 3 languages to be able to help data engineers transform data efficiently in ETL and ELT pipeline development. These three languages are:

  • Structured Query Languages(SQL): This is the main language used in dbt. It is used in transforming data efficiently in databases.
  • YAML: This is the language used to configure dbt in configuration files i.e. dbt_project.yml file which contains configuration settings for our models, tests among other dbt essentials.
  • Jinja: This is a powerful templating engine used in dbt. It can also be used in other languages i.e. Python (Django, Flask) where it is combined to work with HTML files to create robust and functional web applications. Using Jinja in dbt allows for reusability for our code in various files within our project, making it very important for users. Read more about Jinja here.

Instead of writing massive SQL scripts, dbt helps you:

  • Write modular SQL
  • Run version-controlled transformations
  • Test and document your models with easy to navigate UI.
  • Embrace software engineering best practices for data analytics

dbt (data build tool) works exceptionally well with CTEs (Common Table Expressions) because they align closely with dbt’s core principles:

  • clarity
  • modularity
  • reusability
  • testability.

There is a small introduction for what dbt is and what it entails, let's dive deeper into it and get to know its basics, install dbt-core, write our first dbt model and learn more about this powerful tool for data engineers.

1. Installing and initializing dbt.

To install dbt, we need to either install dbt-core or dbt-cloud. For this blog, we will install dbt-core and we will install dbt-cloud in another post.

To work with dbt, you need to have the following prerequisites:

  • Installing Python packages in ther terminal
  • Beginner to advanced SQL knowledge
  • Eagerness to learn!

Before installing dbt make sure you have installed and activated your virtual environment. This helps with preventing package version mixups and errors.
To install and activating your virtual environment according to your operating system:


python3 -m venv your_env

source your_env/bin/activate # Linux / MacOS
your_env\Scripts\activate # Windows
Enter fullscreen mode Exit fullscreen mode

To install dbt, run the following commands, following the database/data source/data warehouse that you would like to link to e.g. postgres, redshift, snowflake, bigquery.


pip install dbt-core dbt-postgres # If you choose postgres
pip install dbt-core dbt-redshift # If you choose redshift
pip install dbt-core dbt-snowflake # If you choose snowflake
pip install dbt-core dbt-bigquery # If you choose bigquery

Enter fullscreen mode Exit fullscreen mode

Run the following command to initialize our dbt project:


dbt init your_project_name
cd your_project_name

Enter fullscreen mode Exit fullscreen mode

This will prompt you to enter some data source credentials e.g. port, host, user and password. By running this command for the first time, it will set up .dbt/ folder in your HOME directory for your global configuration files e.g. profiles.yml that dbt uses to connect to your data warehouse.

This also creates very important files for us, like models which contains our models, dbt_project which contains our project's configuration settings, tests for our tests among other folders.

2. Creating our first dbt model.

To create our first model, head our to our models/ folder and create a file, name it as you like. Let's name it high_cost_houses.sql. My data warehouse contains house data in Kenya from low range to high range cost houses. I needed to get all the high cost houses.

In dbt, when running a model, it creates a materialized view in our tables where we can go and view the results in our database. We can change that in dbt_project.yml file, I'll show you how to do that later on, don't worry!

For our first model:

with high_cost_houses as (
    select titles, locations, prices
    from public.house_data
    where prices >= 50000000
    order by prices desc
)
select * from high_cost_houses
Enter fullscreen mode Exit fullscreen mode

This query fetches the houses whose price is above 50 million. To run this model:

dbt run
Enter fullscreen mode Exit fullscreen mode

This creates a view in our data warehouse and you can go check it out under Materialized views.

a. Modularity in dbt

dbt uses both Jinja templates and SQL to ensure that we can use our models wherever in our models without repeating ourselves. This ensures that our models are reusable and maintainable. For example, to use our high_cost_houses.sql in another model, we can use ref() to reference to that model as so:


-- Let's get houses that are in Runda from our high_cost_houses

SELECT *
FROM {{ ref('high_cost_houses') }}
WHERE locations = 'Runda'
Enter fullscreen mode Exit fullscreen mode

We have reused our model in our new model using {{ ref('high_cost_houses') }}.

b. Materialization in dbt

Remember when I said I would help you override dbt's default materialization from a view to a table? Let's do it.
Head over to your dbt_project and change this from view to table

models:
  dbt_practice:
# change this from view to table
    high_cost_houses:
      +materialized: table # previously: view
Enter fullscreen mode Exit fullscreen mode

Tables are easier to access but take up more space than views which are cheaper to store and easier to run.

3. dbt Sources, Seeds and Analyses.

Sources are data sources from our data warehouse, seeds are csv files that we would like to load for analysis while analyses are SQL queries that are one off queries and are not materialized into our data warehouse as tables or views. If we would want to materialize them, we should move them to the models/ folder.

To reference our source in dbt, we use the source('db_name', 'table_name') method to do so. Make sure you have set the source in our dbt_project.yml file as so

sources:
    - name: database_loading
    tables:
        - name: house_data_loading

Enter fullscreen mode Exit fullscreen mode

Then reference it as so:

SELECT *
FROM {{ source('database_loading', 'house_data_loading') }}
Enter fullscreen mode Exit fullscreen mode

4. Testing in dbt

In dbt, there are two types of tests which include:

a. Singular tests - These are tests that are run in one model and don't need to be repeated in other models. This tests actually validate outcomes of models. For example, in our high_cost_houses.sql model, our houses should not have a price less than 50 million, so let's write a test to validate this.

Head over to the tests/ folder and create a new test file called check_prices.sql and write the following test query:


SELECT *
FROM {{ ref(''high_cost_houses'') }}
WHERE prices < 50000000

Enter fullscreen mode Exit fullscreen mode

After writing our test, run this to run our test:

dbt test
Enter fullscreen mode Exit fullscreen mode

If our test fails, it means our data has some values that have a price below 50 million and should check that.

b. Generic tests - These are tests that can be performed on many tables. These tests follow the Don't Repeat Yourself (DRY) principle.
There are built-in and pre built-in generic tests in dbt. Examples of generic tests include:

  • not_null
  • unique
  • accepted_values
  • relationships

For example, we would to check if there are any locations that are empty in the high cost and low cost tables in our data warehouse. Let's write a generic test to check for any empty values using SQL and Jinja:

{% test empty_locations(model, column_name) %}
    SELECT {{ column_name }}
    FROM {{ model }}
    WHERE TRIM{{ column_name }} = ''
{% endtest %}
Enter fullscreen mode Exit fullscreen mode

To enter the test for these tables, create a test.yml file and load the test inside as so:

models:
  - name: low_cost_houses
    columns:
      - name: locations
        tests:
          - empty_locations

  - name: high_cost_houses
    columns:
      - name: locations
        tests:
          - empty_locations
Enter fullscreen mode Exit fullscreen mode

Then run:

dbt test
Enter fullscreen mode Exit fullscreen mode

dbt Test Result

This will run our tests in our models and validate our output.

We can also load built-in generic tests in our test.yml file without writing much code. To do so:

models:
  - name: low_cost_houses
    columns:
      - name: locations
        tests:
          - not_null

  - name: high_cost_houses
    columns:
      - name: locations
        tests:
          - not_null
Enter fullscreen mode Exit fullscreen mode

And run:

dbt test
Enter fullscreen mode Exit fullscreen mode

This will work the same way!

5. dbt Macros

Macros are functions in dbt. They make code reusable, easier to use, modular, scalable and easier to maintain.

They are constructed using the Jinja templates:

{% macro macro_name(params) %}
--code
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

Let's create a macro of calculating the average house prices per location in Kenya.

In the macros/ folder, create a file avg_house_price.sql and write the following code:

-- Macro to get avg prices of houses per location
{% macro get_avg_prices(locations, price, table_name) %}
SELECT {{ locations }}, AVG({{ price }}) AS avg_price
FROM {{ table_name }}
WHERE {{ price }} IS NOT NULL
GROUP BY {{ locations }}
{% endmacro %}
Enter fullscreen mode Exit fullscreen mode

To use our macro in our model, create a new model avg_houses_per_location.sql in our models folder and write the following code:

{{ get_avg_prices('locations', 'prices', 'house_data') }}
Enter fullscreen mode Exit fullscreen mode

Check how easy it is to create models with macros! It makes work easier as I can reuse that macro in other models.

6. dbt packages.

Imagine what languages would be like without packages, e.g. Python without requests, pandas? It would be tiring for any engineer or developer to call APIs or clean data using Python.

dbt has packages that makes our work easier with running models, testing and building data pipelines. One of the widely used dbt package is dbt expectations and dbt utils.

Go through the dbt Hub for more dbt packages.

Let's import dbt utils from dbt Hub into our project.
In the packages.yml file, configure our package as so:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
Enter fullscreen mode Exit fullscreen mode

Then run the following command in your terminal to install the package and use them in our code:

dbt deps
Enter fullscreen mode Exit fullscreen mode

Install dbt packages

Now we can use dbt utils in our code!

{{ dbt_utils.star(from=ref('users')) }}
Enter fullscreen mode Exit fullscreen mode

Conclusion

dbt is quite diverse and we've covered only the fundamentals in this blog. For more information on dbt check the following links:

Thank you for reading this post, like, comment and share this post. This will be much appreciated!

Top comments (0)