DEV Community

Mahesh Ravaji
Mahesh Ravaji

Posted on

Prometheus Concepts: A Comprehensive Guide with Definitions, Examples, and PromQL Queries

**1. Observability
**Definition: Observability is the practice of understanding and monitoring the internal state of a system by analyzing its data outputs. It enables teams to gain insights, troubleshoot effectively, and proactively manage system health.
Examples:
A microservices-based application with multiple dependencies needs observability to identify which component causes slowdowns or failures. Observability tools like Prometheus collect logs, metrics, and traces to enable quick root-cause analysis.

**2. Logging, Metrics, and Traces
**Logging: Records discrete events, often with a timestamp and context, helpful for tracking specific actions.
Example: Application logs record login attempts with time and status (success or failure).

Metrics: Quantitative data about the system, such as CPU usage or response time.
Example: A metric showing CPU usage over time helps identify when the system experiences high loads.

PromQL: node_cpu_seconds_total - tracks CPU time on a per-core basis.
Traces: Capture the flow of a single request across various services, used in distributed systems to debug issues across microservices.
Example: Tracing a user request across services to detect bottlenecks.


**3. Service Level Indicator (SLI), Service Level Objective (SLO), and Service Level Agreement (SLA)
**SLI: A metric reflecting a key aspect of service performance (e.g., latency, availability).
Example: Average response time for user requests.

SLO: A specific goal for an SLI (e.g., 99% availability).
Example: "99.9% of requests should have a latency under 100ms."

SLA: A contractual commitment based on SLOs, often with penalties.
Example: SLA guarantees uptime, and if the service goes below it, the provider incurs penalties.

PromQL Example for SLI:
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
This calculates the 99th percentile of request duration, useful for setting SLOs based on latency.

  1. Prometheus and its Components Definition: Prometheus is an open-source monitoring and alerting toolkit that uses a time-series database to store metrics data, which can be queried and visualized. Components: Prometheus Server: Collects metrics by scraping configured endpoints. Alertmanager: Manages and sends alerts based on user-defined thresholds. PromQL: A powerful query language for querying metric data.

Real-World Example: In a Kubernetes setup, Prometheus can monitor pod CPU and memory usage, while Alertmanager sends notifications when usage exceeds set thresholds.
Prometheus VisualMap :)

  1. Counters, Gauges, Histograms, and Summaries Counter: A metric type that only increases, used for counting occurrences. Example: Total HTTP requests (http_requests_total).

Gauge: A metric that can go up or down, like current temperature or memory usage.
Example: node_memory_MemAvailable_bytes to monitor free memory.

Histogram: Buckets observations (e.g., request durations).
Example: http_request_duration_seconds_bucket tracks request duration.

Summary: Similar to histogram but with pre-configured quantiles.
Example: http_request_duration_seconds calculates percentiles for request duration.

PromQL Example:
rate(http_requests_total[5m])
For a Counter showing request rate over a 5-minute window.

  1. PromQL (Prometheus Query Language) Definition: A flexible language used to query Prometheus metrics for real-time data and historical analysis. Example: sum(rate(cpu_usage_seconds_total[1m])) by (instance) This query calculates the CPU usage rate per instance over a 1-minute window. Aggregation Example: avg_over_time(memory_usage_bytes[5m]) This shows average memory usage over a 5-minute window, often used in dashboards or alerts.
  2. Alerting and Alertmanager Definition: Alerting is the process of notifying teams about specific conditions in the system. Alertmanager manages alert delivery, deduplication, and routing. Example: If CPU usage exceeds 80% on any instance for over 5 minutes, send an alert.

PromQL Alert Rule:

  • alert: HighCPUUsage expr: avg(rate(node_cpu_seconds_total[5m])) by (instance) > 0.8 for: 5m labels: severity: "warning" annotations: summary: "High CPU usage detected"

  1. Exporters Definition: Exporters collect metrics from applications and systems, exposing them in a Prometheus-compatible format. Example: The Node Exporter provides metrics for Linux systems, such as disk and memory usage. Common Exporters: Node Exporter: System metrics. MySQL Exporter: Database performance metrics.

Prometheus Configuration Example:
scrape_configs:

  • job_name: "node" static_configs:
    • targets: ["localhost:9100"] This configures Prometheus to scrape the Node Exporter at port 9100.
      1. Push vs. Pull-Based Monitoring Pull-Based Model: Prometheus pulls metrics by scraping endpoints. Example: Prometheus scrapes Node Exporter metrics every 15 seconds. Push-Based Model: Targets push metrics to Prometheus via a Pushgateway. Example: Short-lived jobs send data to Prometheus using Pushgateway.

Pushgateway Setup:
In Prometheus config:
scrape_configs:

  • job_name: 'pushgateway' honor_labels: true static_configs:
    • targets: ['localhost:9091']
      1. Prometheus Configuration and Setup Definition: Prometheus uses a YAML file (prometheus.yml) to configure data scraping, alerting, and targets. Example: Basic configuration to scrape local Prometheus and Node Exporter: global: scrape_interval: 15s scrape_configs:
  • job_name: "prometheus" static_configs:
    • targets: ["localhost:9090"]
  • job_name: "node" static_configs:

Top comments (0)