DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Strengthen your chaos engineering with built-in security. Live Nov 13, 1PM ET — reserve your seat!

Related

  • Augmented Analytics With PySpark and Sentiment Analysis
  • No More ETL: How Lakebase Combines OLTP, Analytics in One Platform
  • Python Packages for Validating Database Migration Projects
  • How Doris Connects to Multiple Databases for Federated Analytics and Data Migration

Trending

  • From Requirements to Results: Anchoring Agile With Traceability
  • Building a Scalable and Reliable Marketing Data Stack on GCP
  • Best Practices for Designing Resilient APIs for Scalability and Reliability
  • MySQL vs. Postgres in 2024
  1. DZone
  2. Data Engineering
  3. Data
  4. Getting Started With DuckDB in Python: A Fast and Lightweight Analytics Database

Getting Started With DuckDB in Python: A Fast and Lightweight Analytics Database

DuckDB is an embedded SQL OLAP engine that runs analytical queries on local files and DataFrames in Python, with minimal setup and no external server.

By 
Praveen Kumar Myakala user avatar
Praveen Kumar Myakala
·
Jul. 28, 25 · Tutorial
Likes (53)
Comment
Save
Tweet
Share
6.8K Views

Join the DZone community and get the full member experience.

Join For Free

Data professionals today are looking for tools that combine speed, ease of use, and flexibility. DuckDB is one such tool that is gaining popularity quickly. It is an embedded SQL OLAP database that lets you run fast analytical queries right inside your Python environment. Whether you are working in a Jupyter notebook or building a local pipeline, DuckDB helps you explore and analyze data efficiently without needing a full-scale database setup.

Think of DuckDB as SQLite's analytical cousin. While SQLite is designed for transactional workloads, DuckDB is optimized for analytical tasks. It reads Parquet and CSV files directly, integrates easily with pandas, and runs SQL with minimal overhead. This tutorial will walk you through everything you need to start using DuckDB with Python.

Step 1: Installing DuckDB

Getting started with DuckDB is simple. You can install it using pip. There is no server to configure, and no background process to run. Just install it like any other Python package.

pip install duckdb


Once installed, you can import it and begin writing queries right away. DuckDB runs in-process, which means you do not need to manage separate database instances. This makes it perfect for interactive data analysis and rapid prototyping.

Step 2: Creating and Querying a DataFrame

Let us begin by creating a basic pandas DataFrame. This will serve as our sample dataset to test DuckDB’s SQL capabilities.

import duckdb
import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 40],
    'salary': [70000, 80000, 90000, 100000]
})


DuckDB can query this DataFrame directly. You do not need to load the data into a separate table. You can just write SQL on it as if it were a table.

result = duckdb.query("SELECT name, salary FROM df WHERE age > 30").to_df()
print(result)


This kind of integration is extremely useful if you are more comfortable with SQL than pandas syntax. It also lets you run more complex queries without manually writing loops or lambda functions.

Step 3: Aggregation and Statistics with SQL

DuckDB supports all common SQL operations, including grouping, aggregation, and filtering. For example, you can calculate average salary and maximum age from the same DataFrame:

duckdb.query("SELECT AVG(salary) AS avg_salary, MAX(age) AS oldest FROM df").to_df()


This provides a clean, declarative way to analyze your data. You do not need to switch between pandas and SQL. You can stay in SQL for most of your work, which is especially helpful when working in teams where SQL is the shared language.

Step 4: Reading CSV and Parquet Files Efficiently

One of the most impressive features of DuckDB is that it can read CSV and Parquet files directly from disk without loading the entire dataset into memory. This is incredibly useful when you work with large files or want to perform quick filtering before loading data into memory.

# Read a CSV file directly
duckdb.query("SELECT * FROM 'employees.csv' WHERE department = 'Engineering'").to_df()

# Read a Parquet file
duckdb.query("SELECT name, salary FROM 'employees.parquet' WHERE salary > 90000").to_df()


These queries are executed efficiently and only scan the necessary parts of the file. This gives DuckDB an edge over pandas when dealing with large datasets.

Step 5: Using DuckDB as a Persistent or In-Memory Database

DuckDB can operate in both in-memory mode and persistent mode. If you are just experimenting or doing exploratory work, an in-memory session is perfect. If you want to save tables and reuse them later, you can use a file-based database.

# In-memory session
con = duckdb.connect()
con.register('people', df)
con.execute("SELECT COUNT(*) FROM people WHERE salary > 80000").fetchall()

# Persistent session
con = duckdb.connect('my_duckdb_file.duckdb')


This flexibility makes DuckDB suitable for quick experiments as well as reproducible scripts or pipelines.

Step 6: Comparing Performance on Large Datasets

To demonstrate how DuckDB compares to pandas, let us run a basic aggregation over one million rows.

import numpy as np
import time

large_df = pd.DataFrame({
    'id': np.random.randint(1, 10000, size=1_000_000),
    'value': np.random.rand(1_000_000)
})

# Using pandas
start = time.time()
pandas_result = large_df.groupby(large_df['id'] % 10)['value'].sum()
print("Pandas time:", time.time() - start)

# Using DuckDB
duckdb.register('large', large_df)
start = time.time()
duck_result = duckdb.query("SELECT id % 10 AS group_id, SUM(value) FROM large GROUP BY group_id").to_df()
print("DuckDB time:", time.time() - start)


In most tests, DuckDB performs faster than pandas, especially when aggregating or joining large datasets. It also consumes less memory, since it uses streaming execution rather than loading everything at once.

Step 7: Real-World Use Cases for DuckDB

DuckDB fits into a variety of workflows. Analysts can use it to explore large files locally. Data engineers can use it in ETL jobs for filtering and transforming data. Researchers can use it to run reproducible queries in Jupyter notebooks.

Some practical examples include:

  • Generating summaries from large Parquet datasets without loading them into memory
  • Combining multiple CSVs and filtering them with SQL
  • Building local dashboards that query data in real time using DuckDB under the hood

You can also use DuckDB with other tools such as dbt, Polars, or Apache Arrow, making it a strong option for modern data workflows.

Summary

DuckDB is a powerful tool that brings fast, in-process SQL to your Python projects. It offers the ease of SQL with the performance of a native engine and fits neatly into existing workflows involving pandas, CSV, and Parquet.

If you are looking for a lightweight, modern alternative to heavy data warehouse solutions for local analysis, DuckDB is worth a try. It works out of the box, performs well on large data, and supports many real-world use cases with minimal setup.

Analytics Database Python (language)

Opinions expressed by DZone contributors are their own.

Related

  • Augmented Analytics With PySpark and Sentiment Analysis
  • No More ETL: How Lakebase Combines OLTP, Analytics in One Platform
  • Python Packages for Validating Database Migration Projects
  • How Doris Connects to Multiple Databases for Federated Analytics and Data Migration

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: