Getting Started With DuckDB in Python: A Fast and Lightweight Analytics Database
DuckDB is an embedded SQL OLAP engine that runs analytical queries on local files and DataFrames in Python, with minimal setup and no external server.
Join the DZone community and get the full member experience.
Join For FreeData professionals today are looking for tools that combine speed, ease of use, and flexibility. DuckDB is one such tool that is gaining popularity quickly. It is an embedded SQL OLAP database that lets you run fast analytical queries right inside your Python environment. Whether you are working in a Jupyter notebook or building a local pipeline, DuckDB helps you explore and analyze data efficiently without needing a full-scale database setup.
Think of DuckDB as SQLite's analytical cousin. While SQLite is designed for transactional workloads, DuckDB is optimized for analytical tasks. It reads Parquet and CSV files directly, integrates easily with pandas, and runs SQL with minimal overhead. This tutorial will walk you through everything you need to start using DuckDB with Python.
Step 1: Installing DuckDB
Getting started with DuckDB is simple. You can install it using pip. There is no server to configure, and no background process to run. Just install it like any other Python package.
pip install duckdb
Once installed, you can import it and begin writing queries right away. DuckDB runs in-process, which means you do not need to manage separate database instances. This makes it perfect for interactive data analysis and rapid prototyping.
Step 2: Creating and Querying a DataFrame
Let us begin by creating a basic pandas DataFrame. This will serve as our sample dataset to test DuckDB’s SQL capabilities.
import duckdb
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 40],
'salary': [70000, 80000, 90000, 100000]
})
DuckDB can query this DataFrame directly. You do not need to load the data into a separate table. You can just write SQL on it as if it were a table.
result = duckdb.query("SELECT name, salary FROM df WHERE age > 30").to_df()
print(result)
This kind of integration is extremely useful if you are more comfortable with SQL than pandas syntax. It also lets you run more complex queries without manually writing loops or lambda functions.
Step 3: Aggregation and Statistics with SQL
DuckDB supports all common SQL operations, including grouping, aggregation, and filtering. For example, you can calculate average salary and maximum age from the same DataFrame:
duckdb.query("SELECT AVG(salary) AS avg_salary, MAX(age) AS oldest FROM df").to_df()
This provides a clean, declarative way to analyze your data. You do not need to switch between pandas and SQL. You can stay in SQL for most of your work, which is especially helpful when working in teams where SQL is the shared language.
Step 4: Reading CSV and Parquet Files Efficiently
One of the most impressive features of DuckDB is that it can read CSV and Parquet files directly from disk without loading the entire dataset into memory. This is incredibly useful when you work with large files or want to perform quick filtering before loading data into memory.
# Read a CSV file directly
duckdb.query("SELECT * FROM 'employees.csv' WHERE department = 'Engineering'").to_df()
# Read a Parquet file
duckdb.query("SELECT name, salary FROM 'employees.parquet' WHERE salary > 90000").to_df()
These queries are executed efficiently and only scan the necessary parts of the file. This gives DuckDB an edge over pandas when dealing with large datasets.
Step 5: Using DuckDB as a Persistent or In-Memory Database
DuckDB can operate in both in-memory mode and persistent mode. If you are just experimenting or doing exploratory work, an in-memory session is perfect. If you want to save tables and reuse them later, you can use a file-based database.
# In-memory session
con = duckdb.connect()
con.register('people', df)
con.execute("SELECT COUNT(*) FROM people WHERE salary > 80000").fetchall()
# Persistent session
con = duckdb.connect('my_duckdb_file.duckdb')
This flexibility makes DuckDB suitable for quick experiments as well as reproducible scripts or pipelines.
Step 6: Comparing Performance on Large Datasets
To demonstrate how DuckDB compares to pandas, let us run a basic aggregation over one million rows.
import numpy as np
import time
large_df = pd.DataFrame({
'id': np.random.randint(1, 10000, size=1_000_000),
'value': np.random.rand(1_000_000)
})
# Using pandas
start = time.time()
pandas_result = large_df.groupby(large_df['id'] % 10)['value'].sum()
print("Pandas time:", time.time() - start)
# Using DuckDB
duckdb.register('large', large_df)
start = time.time()
duck_result = duckdb.query("SELECT id % 10 AS group_id, SUM(value) FROM large GROUP BY group_id").to_df()
print("DuckDB time:", time.time() - start)
In most tests, DuckDB performs faster than pandas, especially when aggregating or joining large datasets. It also consumes less memory, since it uses streaming execution rather than loading everything at once.
Step 7: Real-World Use Cases for DuckDB
DuckDB fits into a variety of workflows. Analysts can use it to explore large files locally. Data engineers can use it in ETL jobs for filtering and transforming data. Researchers can use it to run reproducible queries in Jupyter notebooks.
Some practical examples include:
- Generating summaries from large Parquet datasets without loading them into memory
- Combining multiple CSVs and filtering them with SQL
- Building local dashboards that query data in real time using DuckDB under the hood
You can also use DuckDB with other tools such as dbt, Polars, or Apache Arrow, making it a strong option for modern data workflows.
Summary
DuckDB is a powerful tool that brings fast, in-process SQL to your Python projects. It offers the ease of SQL with the performance of a native engine and fits neatly into existing workflows involving pandas, CSV, and Parquet.
If you are looking for a lightweight, modern alternative to heavy data warehouse solutions for local analysis, DuckDB is worth a try. It works out of the box, performs well on large data, and supports many real-world use cases with minimal setup.
Opinions expressed by DZone contributors are their own.
Comments