Ivan Pesenti

Posted on May 28

Take it easy with Graphite and Docker 🐳

#go #graphite #docker #performance

I've been stuck recently at work while trying to write an end-to-end test against a web server that was exposing its capabilities via REST APIs. The issue was me trying to make the system under test write some metrics in a specific Graphite instance, used to collect all the metrics emitted in a cloud environment. Ideally, the desired workflow was:

Run a test invoking an HTTP endpoint with an invalid request
The HTTP endpoint processes the request and sends back the response to the client
Contextually, the HTTP endpoint should have also written a metric to the Graphite server to record the failing request
The test, as part of his assertions, should have also checked whether the metric has been emitted or not

Nothing too fancy, you might think... But I was not able to get back a reliable result. I was not satisfied with the result, so that's why I decided to use some time to dig into this 🔎.

TLDR the root cause 🌳

There were a couple of things worth noting in this approach. First and foremost, this check isn't meaningful in the context of the end-to-end test. This should check the correct behavior exposed by the System Under Test and not its internals like emitting a metric or writing a log entry.

The second problem was the query used to retrieve the desired metric. Especially the from query parameter in the Graphite /render API, played a crucial role in getting the expected result (I'll cover that later).

Another issue was the lack of knowledge of the Graphite Server's configuration.

The list could go ahead, but I prefer to stop here to preserve my developer reputation 😆 .

Graphite

The first thing to do was to improve my Graphite knowledge to understand its internals and fully control it 🪄. There you go. Below is a bulleted list with some of the concepts you might need to be aware of:

Graphite is a time-series database or TSDB. Graphite Documentation
In case you're not familiar with TSDB, please refer to this documentation It's vital that you understand how it works before proceeding with this blog post.
There are three components in the Graphite infrastructure:
1. Carbon: the backend of Graphite listening for time-series data. It can ingest data via different protocols. We'll use the plaintext protocol
2. Whisper: text-based file used to store data points received by Carbon
3. Graphite-Web: UI application used to render graphs and dashboards
There are two kinds of metrics that Graphite natively supports:
1. Counter: only-increasing metric
2. Gauge: a picture of a value at a specific time. This is also used to create Timers (histograms)
The metrics are organized in a hierarchical way (with the . used to qualify/organize the metrics). Good naming is fundamental to organizing and fetching metrics properly, by also supporting the usage of wildcard characters. Give a look at this amazing blog post for further details

There's more to cover, but I don't want to be too much off-topic with the blog post.

Docker Come to the Rescue 🛡️

I decided to create a small project to experiment and deep dive into Graphite (yes, you can blame me).

You can already guess my saviors: Docker & testcontainers-go. These two guys saved my day.

Spoiler: it was not the first and it won't be the last 😍

Now, let's dirty our hands.

If you get lost, you can find the full repo at my GitHub account.

I used the typical TODO application since I didn't want to consume extra cognitive load to understand a more complex app. Thus, allowing me to focus only on the technologies I needed to experiment with.

System Under Test 🏋

The app is a simple web server exposing two HTTP: GET routes via REST. Below, I share the most significant files used.

The HTTP Handler 🕸️

The code is in the internal/todos/todos.go file. Here, I share the code for the GetTodoByID handler:

func (t *TodoHandler) GetTodoByID(w http.ResponseWriter, r *http.Request) {
 rawID := r.URL.Query().Get("id")
 if rawID == "" {
  metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.missing_id", 1.0)
  w.WriteHeader(http.StatusBadRequest)
  w.Write([]byte("please provide a TODO ID"))
  return
 }
 id, err := strconv.Atoi(rawID)
 if err != nil {
  metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.invalid_id", 1.0)
  w.WriteHeader(http.StatusBadRequest)
  w.Write([]byte("please provide a numeric TODO ID"))
  return
 }
 for _, v := range todos {
  if v.ID == id {
   data, err := json.MarshalIndent(v, "", "\t")
   if err != nil {
    metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.invalid_format", 1.0)
    w.WriteHeader(http.StatusInternalServerError)
    return
   }
   metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.success", 1.0)
   w.Write(data)
   return
  }
 }
 metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.not_found", 1.0)
 w.WriteHeader(http.StatusNotFound)
 w.Write([]byte("todo not found"))
}

Please look at how I shaped the metrics naming. Keep the names consistent, and it will be much easier to retrieve them and not mess things up.

Metrics Sending 📩

The source code is in the file internal/metrics/manager.go. The content is:

func WriteMetricWithPlaintext(graphiteConn net.Conn, name string, value float64) {
 if _, err := fmt.Fprintf(graphiteConn, "%s %f %d\n", name, value, time.Now().Unix()); err != nil {
  fmt.Println("error while wrapping metrics to Graphite:", err.Error())
 }
}

We're sending metrics via the plaintext protocol. The message must adhere to the following string template %s %f %d\n where:

%s is the metric name like webserver.get_todo_by_id.success
%f is the metric value in float64 like 1.0
%d is the timestamp in Unix format like 1748413179

A sample of the message is like webserver.get_todo_by_id.success 1.0 1748413179, followed by a \n character.

The conn parameter is a simple net/TCP, instantiated in the init() function of the cmd/webserver/main.go file:

func init() {
 graphiteHost := config.GetEnvOrDefault("GRAPHITE_HOSTNAME", "graphite")
 graphitePort := config.GetEnvOrDefault("GRAPHITE_PORT", "2003")
 conn, err := net.Dial("tcp", net.JoinHostPort(graphiteHost, graphitePort))
 if err != nil {
  panic(err)
 }
 todoHandler = todos.NewTodoHandler(conn)
 if todoHandler == nil {
  panic("could not start the application")
 }
}

graphite is the name of the Graphite container we're going to use. Let's see how we can power up our simple yet effective application.

Containerize the Web Server 🎁

The Dockerfile is pretty basic, so I won't spend time covering it.

FROM golang:1.24-alpine AS build

WORKDIR /app

COPY go.mod go.sum ./

RUN go mod tidy && go mod download
RUN go mod verify

COPY . .

RUN go build -o webserver cmd/webserver/main.go

FROM alpine

COPY --from=build /app/webserver /webserver

EXPOSE 8080

CMD [ "./webserver" ]

What's most interesting to cover is the Docker Compose file we will be using to start the two containers at once.

Power Up 🔋

To coordinate the startup of the containers, we will use the docker-compose.yml file. The content is below:

services:
  webserver:
    build: "."
    container_name: webserver
    restart: always
    environment:
      - GRAPHITE_HOSTNAME=graphite
      - GRAPHITE_PLAINTEXT_PORT=2003
    ports:
      - 8080:8080
    depends_on:
      graphite:
        condition: service_healthy
    networks:
      - todo-network

  graphite:
    image: graphiteapp/graphite-statsd
    container_name: graphite
    restart: always
    ports:
      - 80:80
      - 2003-2004:2003-2004
      - 2023-2024:2023-2024
      - 8125:8125/udp
      - 8126:8126
    healthcheck:
      test: ["CMD-SHELL", "netstat -an | grep -q 2003"]
      interval: 10s
      retries: 3
      start_period: 30s
      timeout: 10s
    networks:
      - todo-network

networks:
  todo-network:
    driver: bridge

Pay attention to the following key points:

the todo-network was created to make communication between containers possible
the environment value used in the webserver service to refer to the graphite service
the depends_on condition with service_healthy value defined in the webserver service
the ports mapped in the graphite service (you can map only the ones you need)
the healthcheck defined in the graphite service that links to the depends_on condition defined above

Now, let's see if we can overcome the initial issue of being able to test the correct Graphite metrics emission.

The Test Code 🪖

Here, the big player is testcontainers-go. If you're curious and want to learn more about it, take a look at the documentation.

I'm a huge fan of this package, and I believe it's something you must try in your next project. Let's engage in a discussion if you want to find out more about how I use it in my projects.

With this package, I'm able to spawn a fresh new webserver and graphite containers on each test run. This helps to correctly assess metrics. It provides better isolation and control of what's happening with the Graphite container.

The code I used to interact with the Docker containers is contained in the tests/container.go file:

package tests

import (
 "context"
 "os"
 "testing"

 "github.com/stretchr/testify/require"
 tc "github.com/testcontainers/testcontainers-go/modules/compose"
)

func spawnWebServerContainer(t *testing.T) {
 t.Helper()
 os.Setenv("TESTCONTAINERS_RYUK_DISABLED", "true")
 compose, err := tc.NewDockerComposeWith(tc.WithStackFiles("../docker-compose.yml"))
 require.NoError(t, err)
 t.Cleanup(func() {
  require.NoError(t, compose.Down(context.Background(), tc.RemoveOrphans(true), tc.RemoveImagesLocal))
 })
 ctx, cancel := context.WithCancel(context.Background())
 t.Cleanup(cancel)
 err = compose.
  Up(ctx, tc.Wait(true))
 require.NoError(t, err)
}

It uses the docker-compose.yml to spin up the containers we need in our Integration Tests. It will also add the cleanup code.

Test the HTTP Handler

The test code for the GetTodoByID handler resides in the tests/get_todo_by_id_test.go file.
An extract of its content is:

func TestGetTodoByID(t *testing.T) {
 spawnWebServerContainer(t)
 client := http.Client{}
 // ... success scenario omitted for brevity
 t.Run("Invalid ID", func(t *testing.T) {
  r, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "http://127.0.0.1:8080/todo?id=abc", nil)
  require.NoError(t, err)
  res, err := client.Do(r)
  require.NoError(t, err)
  require.Equal(t, http.StatusBadRequest, res.StatusCode)

  baseUrl, err := url.Parse("http://127.0.0.1:80/render")
  require.NoError(t, err)
  params := url.Values{}
  params.Add("target", "webserver.get_todo_by_id.errors.invalid_id")
  params.Add("from", "-5min")
  params.Add("format", "json")
  baseUrl.RawQuery = params.Encode()
  require.NoError(t, err)
  r, err = http.NewRequestWithContext(context.Background(), http.MethodGet, baseUrl.String(), nil)
  require.NoError(t, err)
  require.EventuallyWithT(t, func(collect *assert.CollectT) {
   isMetricEmitted, err := isMetricEmitted(client, r, "webserver.get_todo_by_id.errors.invalid_id", 1)
   require.NoError(collect, err)
   require.True(collect, isMetricEmitted)
  }, time.Second*30, time.Second*3, "metric not emitted enough times")
 })

 t.Run("Missing ID", func(t *testing.T) {
  r, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "http://127.0.0.1:8080/todo?id=", nil)
  require.NoError(t, err)
  res, err := client.Do(r)
  require.NoError(t, err)
  require.Equal(t, http.StatusBadRequest, res.StatusCode)

  baseUrl, err := url.Parse("http://127.0.0.1:80/render")
  require.NoError(t, err)
  params := url.Values{}
  params.Add("target", "webserver.get_todo_by_id.errors.missing_id")
  params.Add("from", "-5min")
  params.Add("format", "json")
  baseUrl.RawQuery = params.Encode()
  require.NoError(t, err)
  r, err = http.NewRequestWithContext(context.Background(), http.MethodGet, baseUrl.String(), nil)
  require.NoError(t, err)
  require.EventuallyWithT(t, func(collect *assert.CollectT) {
   isMetricEmitted, err := isMetricEmitted(client, r, "webserver.get_todo_by_id.errors.missing_id", 1)
   require.NoError(collect, err)
   require.True(collect, isMetricEmitted)
  }, time.Second*30, time.Second*3, "metric not emitted enough times")
 })
}

This code is not test-ready. The focus is on:

the Graphite request to get back the raw metrics. We're targeting the /render API with a bunch of values:
- target is the name of the metric
- from is self-explanatory. It could have been omitted, and, in this case, it would have defaulted to 24 hours. This value is used to adjust the precision of the retrieved data points. Setting it too high or too low could filter out the data points we need
- format could have been several other formats such as csv, raw, png, json, and so on
the isMetricEmitted function is used to issue the HTTP request to Graphite. More details on it below

Let's see the code interacting with Graphite.

The Metrics Checker

The code is contained in the tests/metrics.go file:

type graphiteDataPoints []struct {
 Target string `json:"target"`
 Tags   struct {
  Name string `json:"name"`
 } `json:"tags"`
 Datapoints [][2]any `json:"datapoints"`
}

func isMetricEmitted(client http.Client, req *http.Request, metricName string, expectedNumberOfTimes int) (bool, error) {
 res, err := client.Do(req)
 if err != nil {
  return false, err
 }
 defer res.Body.Close()
 if res.StatusCode != http.StatusOK {
  return false, fmt.Errorf("expected status code to be 200OK, got %d", res.StatusCode)
 }
 var graphiteDataPoints graphiteDataPoints
 err = json.NewDecoder(res.Body).Decode(&graphiteDataPoints)
 if err != nil {
  return false, err
 }
 actualNumberOfTimes := 0
 for _, v := range graphiteDataPoints {
  if v.Tags.Name == metricName {
   for _, vv := range v.Datapoints {
    if vv[0] != nil {
     actualNumberOfTimes++
    }
   }
  }
 }
 if actualNumberOfTimes >= expectedNumberOfTimes {
  return true, nil
 }
 return false, fmt.Errorf("metric: %s emitted %d time(s) out of %d", metricName, actualNumberOfTimes, expectedNumberOfTimes)
}

This is only a regular HTTP request sending.

The trickiest part has been how to successfully build the HTTP request to send out.

Has Everything Worked as Expected? ⁉️

To ensure we decently ~~wasted~~ invested time, let's run the tests. Use the command:

go test ./tests -tags=integration

And you should have back an output like:

ok      github.com/ossan-dev/graphitepoc/tests  45.259s

Our tests have successfully run. I hope you have learned something new today!

Thanks for the attention, folks! If you've got any questions, doubts, feedback, or comments, I'm available to listen and speak together. If you want me to cover some specific concepts, please reach me.

Time to drop the pen and grab a deserved coffee ☕

Top comments (2)

Aaron Gibbs • Jun 4

Nice walkthrough! Fun fact: Graphite was originally developed by Orbitz in 2006 to help monitor their own infrastructure before it became one of the go-to open-source TSDBs for devops teams.

Ivan Pesenti • Jun 6

Thanks, @aarongibbs
It's recurring story XD
Some company wants to overcome a need, it creates a tool supposed to be internal, and, then it's getting adopted by half of the world. Does it remind you anything? :-)