When I started using dbt with Databricks, I was mostly focused on writing models and building transformations. But as the data volume grew, I realized the performance and cost implications of materializations and storage formats.
In this post, I want to share some real lessons learned while choosing between view
, table
, and incremental
materializations—and why switching from Parquet to Delta Lake made a huge difference.
Views vs Tables vs Incremental: What Actually Works?
Here’s a quick breakdown of how I use each materialization in real projects:
Materialization | Use Case | Notes |
---|---|---|
view |
Lightweight models (renaming, filtering) | Great for development, but recomputes every time |
table |
Medium-size dimensions | Useful for reference tables that rarely change |
incremental |
Large facts, high-frequency data | Only processes new or changed rows (ideal for big datasets) |
DAG and ref()
– How dbt Manages Dependencies
Instead of hardcoding table names, I use ref()
to link models. For example:
SELECT *
FROM {{ ref('stg_orders') }}
This function does more than just alias a table. Behind the scenes, ref()
helps dbt:
- Build a Directed Acyclic Graph (DAG) of your models
- Determine the correct run order during
dbt run
- Resolve fully qualified table names dynamically across dev/prod environments
- Track lineage for documentation and testing
This becomes incredibly useful in Databricks where the environment, catalog, and schema may differ between workspaces.
For example, {{ ref('stg_orders') }}
might compile into:
`dev_catalog`.`analytics_schema`.`stg_orders`
This makes your project environment-agnostic and easier to manage with Git-based workflows.
Real Problem I Faced with Views on Databricks
In one pipeline, I had several dbt models materialized as view
, assuming it would keep things fast and light. But as models started chaining together—one ref()
leading to another—the final model failed due to out-of-memory errors.
That’s because views in dbt are logical; Databricks has to recompute all the upstream SQL each time. As complexity grew, so did query length and memory usage.
Solution: Switching from view
to table
I changed key intermediate models to use table
materialization:
{{ config(materialized='table') }}
This persisted the results, reduced recomputation, and stabilized the pipeline.
Why I Use Delta Format on Databricks
If you're using Databricks and still materializing models as regular Parquet files, you're missing out.
Delta Lake adds:
- ACID compliance (safe concurrent writes)
- Time travel (query past versions)
- Efficient upserts and merges
- Z-Ordering (for faster filtering)
In my dbt config, I now explicitly define:
{{ config(materialized='table', file_format='delta') }}
And for large tables, I follow up with:
OPTIMIZE analytics.fct_orders ZORDER BY (order_date)
This alone made queries run significantly faster for date-filtered dashboards.
Summary
- Use
view
when developing, but be cautious of deep chains. - Switch to
table
when performance matters or complexity grows. - Always use
ref()
to link models—hardcoding paths will break your project at scale. - On Databricks, prefer Delta over Parquet for better reliability, query speed, and flexibility.
Tags:
Top comments (0)