DEV Community

Cover image for Extracting System Metrics from ClickHouse Using Airflow & Docker
Mohamed Hussain S
Mohamed Hussain S

Posted on

Extracting System Metrics from ClickHouse Using Airflow & Docker

Hey Devs πŸ‘‹,

If you’re diving into data engineering and want to explore monitoring internal database metrics, this post is for you.

As an Associate Data Engineer Intern, I’ve been learning by building β€” and this time, I wanted to peek under the hood of ClickHouse, a blazing-fast OLAP database.
So I built a mini-project to automate the extraction of system-level metrics from ClickHouse every hour using Airflow, Docker, and Python.

Here’s what the project does, how it works, and what I learned πŸ‘‡


πŸ“Š What This Project Does

This mini-pipeline automates:

βœ… Connecting to a running ClickHouse instance
βœ… Querying the system.metrics table for real-time internal metrics
βœ… Using Airflow to schedule this task hourly
βœ… Appending the results to a daily CSV file
βœ… Running everything inside Docker containers

It’s a great way to practice how monitoring, scheduling, and data capture all come together in real-world setups.


🧰 The Tech Stack

  • Python β€” to connect to ClickHouse using clickhouse-connect
  • Airflow β€” to orchestrate hourly metric pulls
  • ClickHouse β€” the OLAP database we’re extracting metrics from
  • Docker β€” to run ClickHouse + Airflow locally
  • CSV Files β€” to store hourly metric snapshots

βš™οΈ How It Works

  1. Airflow DAG runs every hour
  2. DAG triggers a Python script that connects to ClickHouse
  3. Script runs a query on the system.metrics table
  4. Results are appended to a CSV file for that day (e.g., metrics-2025-06-21.csv)
  5. Airflow handles logging and retries in case anything breaks

You’ll end up with a growing CSV file full of hourly metrics β€” a super simple, readable log of system behavior.


Let me know if you’d like help turning this into a LinkedIn post, carousel, or adding diagrams/visuals for the Dev.to article β€” happy to help with those next!

πŸ—‚οΈ Project Structure

extract_clickhouse_metrics/
└── airflow-docker/
    β”œβ”€β”€ dags/                  # Airflow DAG (scheduling logic)
    β”œβ”€β”€ scripts/               # Python script to connect + extract
    β”œβ”€β”€ output/                # Daily CSV metric logs
    β”œβ”€β”€ docker-compose.yaml    # Runs Airflow + ClickHouse together
    β”œβ”€β”€ requirements.txt
    └── logs/ (Airflow logs)
Enter fullscreen mode Exit fullscreen mode

Full repo here:
πŸ‘‰ GitHub: mohhddhassan/extract_clickhouse_metrics


🧠 Key Learnings

βœ… How to use clickhouse-connect to query from Python
βœ… Passing dynamic execution time via Airflow context
βœ… How to append to a CSV with a proper structure (timestamped)
βœ… Setting up Airflow + ClickHouse in a Dockerized workflow
βœ… Building habits around logging and modular DAG design


πŸ” Sample Metric Snapshot (CSV Output)

Here’s what a single row from the output CSV looks like:

timestamp,metric_name,value
2025-06-21 14:00:00,Query,120
2025-06-21 14:00:00,Merge,3
...
Enter fullscreen mode Exit fullscreen mode

Each run adds a fresh row for all metrics in system.metrics, timestamped by Airflow’s execution context.


πŸ”§ What’s Next?

πŸš€ Store the metrics in ClickHouse or PostgreSQL, not just CSV
πŸ“¦ Push daily CSV files to S3 or Google Cloud Storage
πŸ“Š Use Grafana or Streamlit to visualize trends
πŸ” Extract from other tables like system.events or system.asynchronous_metrics


πŸ“Œ Why This Matters

Learning data engineering isn’t just about moving business data.

Understanding the health of your systems (like DBs, pipelines, infra) is just as important. This project gave me insight into how ClickHouse tracks internal activity β€” and how to automate its capture for future analysis.

It’s simple, but powerful.


πŸ™‹β€β™‚οΈ About Me

Mohamed Hussain S
Associate Data Engineer Intern
LinkedIn | GitHub


⏱️ Learning in public β€” one cron job at a time.


Top comments (0)