Hey Devs π,
If youβre diving into data engineering and want to explore monitoring internal database metrics, this post is for you.
As an Associate Data Engineer Intern, Iβve been learning by building β and this time, I wanted to peek under the hood of ClickHouse, a blazing-fast OLAP database.
So I built a mini-project to automate the extraction of system-level metrics from ClickHouse every hour using Airflow, Docker, and Python.
Hereβs what the project does, how it works, and what I learned π
π What This Project Does
This mini-pipeline automates:
β
Connecting to a running ClickHouse instance
β
Querying the system.metrics
table for real-time internal metrics
β
Using Airflow to schedule this task hourly
β
Appending the results to a daily CSV file
β
Running everything inside Docker containers
Itβs a great way to practice how monitoring, scheduling, and data capture all come together in real-world setups.
π§° The Tech Stack
-
Python β to connect to ClickHouse using
clickhouse-connect
- Airflow β to orchestrate hourly metric pulls
- ClickHouse β the OLAP database weβre extracting metrics from
- Docker β to run ClickHouse + Airflow locally
- CSV Files β to store hourly metric snapshots
βοΈ How It Works
- Airflow DAG runs every hour
- DAG triggers a Python script that connects to ClickHouse
- Script runs a query on the
system.metrics
table - Results are appended to a CSV file for that day (e.g.,
metrics-2025-06-21.csv
) - Airflow handles logging and retries in case anything breaks
Youβll end up with a growing CSV file full of hourly metrics β a super simple, readable log of system behavior.
Let me know if youβd like help turning this into a LinkedIn post, carousel, or adding diagrams/visuals for the Dev.to article β happy to help with those next!
ποΈ Project Structure
extract_clickhouse_metrics/
βββ airflow-docker/
βββ dags/ # Airflow DAG (scheduling logic)
βββ scripts/ # Python script to connect + extract
βββ output/ # Daily CSV metric logs
βββ docker-compose.yaml # Runs Airflow + ClickHouse together
βββ requirements.txt
βββ logs/ (Airflow logs)
Full repo here:
π GitHub: mohhddhassan/extract_clickhouse_metrics
π§ Key Learnings
β
How to use clickhouse-connect
to query from Python
β
Passing dynamic execution time via Airflow context
β
How to append to a CSV with a proper structure (timestamped)
β
Setting up Airflow + ClickHouse in a Dockerized workflow
β
Building habits around logging and modular DAG design
π Sample Metric Snapshot (CSV Output)
Hereβs what a single row from the output CSV looks like:
timestamp,metric_name,value
2025-06-21 14:00:00,Query,120
2025-06-21 14:00:00,Merge,3
...
Each run adds a fresh row for all metrics in system.metrics
, timestamped by Airflowβs execution context.
π§ Whatβs Next?
π Store the metrics in ClickHouse or PostgreSQL, not just CSV
π¦ Push daily CSV files to S3 or Google Cloud Storage
π Use Grafana or Streamlit to visualize trends
π Extract from other tables like system.events
or system.asynchronous_metrics
π Why This Matters
Learning data engineering isnβt just about moving business data.
Understanding the health of your systems (like DBs, pipelines, infra) is just as important. This project gave me insight into how ClickHouse tracks internal activity β and how to automate its capture for future analysis.
Itβs simple, but powerful.
πββοΈ About Me
Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub
β±οΈ Learning in public β one cron job at a time.
Top comments (0)