I've been stuck recently at work while trying to write an end-to-end test against a web server that was exposing its capabilities via REST APIs. The issue was me trying to make the system under test write some metrics in a specific Graphite instance, used to collect all the metrics emitted in a cloud environment. Ideally, the desired workflow was:
- Run a test invoking an HTTP endpoint with an invalid request
- The HTTP endpoint processes the request and sends back the response to the client
- Contextually, the HTTP endpoint should have also written a metric to the Graphite server to record the failing request
- The test, as part of his assertions, should have also checked whether the metric has been emitted or not
Nothing too fancy, you might think... But I was not able to get back a reliable result. I was not satisfied with the result, so that's why I decided to use some time to dig into this π.
TLDR the root cause π³
There were a couple of things worth noting in this approach. First and foremost, this check isn't meaningful in the context of the end-to-end test. This should check the correct behavior exposed by the System Under Test and not its internals like emitting a metric or writing a log entry.
The second problem was the query used to retrieve the desired metric. Especially the from
query parameter in the Graphite /render
API, played a crucial role in getting the expected result (I'll cover that later).
Another issue was the lack of knowledge of the Graphite Server's configuration.
The list could go ahead, but I prefer to stop here to preserve my developer reputation π .
Graphite
The first thing to do was to improve my Graphite knowledge to understand its internals and fully control it πͺ. There you go. Below is a bulleted list with some of the concepts you might need to be aware of:
- Graphite is a time-series database or TSDB. Graphite Documentation
- In case you're not familiar with TSDB, please refer to this documentation It's vital that you understand how it works before proceeding with this blog post.
- There are three components in the Graphite infrastructure:
- Carbon: the backend of Graphite listening for time-series data. It can ingest data via different protocols. We'll use the plaintext protocol
- Whisper: text-based file used to store data points received by Carbon
- Graphite-Web: UI application used to render graphs and dashboards
- There are two kinds of metrics that Graphite natively supports:
-
Counter
: only-increasing metric -
Gauge
: a picture of a value at a specific time. This is also used to create Timers (histograms)
-
- The metrics are organized in a hierarchical way (with the
.
used to qualify/organize the metrics). Good naming is fundamental to organizing and fetching metrics properly, by also supporting the usage of wildcard characters. Give a look at this amazing blog post for further details
There's more to cover, but I don't want to be too much off-topic with the blog post.
Docker Come to the Rescue π‘οΈ
I decided to create a small project to experiment and deep dive into Graphite (yes, you can blame me).
You can already guess my saviors: Docker & testcontainers-go
. These two guys saved my day.
Spoiler: it was not the first and it won't be the last π
Now, let's dirty our hands.
If you get lost, you can find the full repo at my GitHub account.
I used the typical TODO application since I didn't want to consume extra cognitive load to understand a more complex app. Thus, allowing me to focus only on the technologies I needed to experiment with.
System Under Test π
The app is a simple web server exposing two HTTP: GET
routes via REST. Below, I share the most significant files used.
The HTTP Handler πΈοΈ
The code is in the internal/todos/todos.go
file. Here, I share the code for the GetTodoByID
handler:
func (t *TodoHandler) GetTodoByID(w http.ResponseWriter, r *http.Request) {
rawID := r.URL.Query().Get("id")
if rawID == "" {
metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.missing_id", 1.0)
w.WriteHeader(http.StatusBadRequest)
w.Write([]byte("please provide a TODO ID"))
return
}
id, err := strconv.Atoi(rawID)
if err != nil {
metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.invalid_id", 1.0)
w.WriteHeader(http.StatusBadRequest)
w.Write([]byte("please provide a numeric TODO ID"))
return
}
for _, v := range todos {
if v.ID == id {
data, err := json.MarshalIndent(v, "", "\t")
if err != nil {
metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.invalid_format", 1.0)
w.WriteHeader(http.StatusInternalServerError)
return
}
metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.success", 1.0)
w.Write(data)
return
}
}
metrics.WriteMetricWithPlaintext(t.GraphiteConn, "webserver.get_todo_by_id.errors.not_found", 1.0)
w.WriteHeader(http.StatusNotFound)
w.Write([]byte("todo not found"))
}
Please look at how I shaped the metrics naming. Keep the names consistent, and it will be much easier to retrieve them and not mess things up.
Metrics Sending π©
The source code is in the file internal/metrics/manager.go
. The content is:
func WriteMetricWithPlaintext(graphiteConn net.Conn, name string, value float64) {
if _, err := fmt.Fprintf(graphiteConn, "%s %f %d\n", name, value, time.Now().Unix()); err != nil {
fmt.Println("error while wrapping metrics to Graphite:", err.Error())
}
}
We're sending metrics via the plaintext protocol. The message must adhere to the following string template %s %f %d\n
where:
-
%s
is the metric name likewebserver.get_todo_by_id.success
-
%f
is the metric value infloat64
like1.0
-
%d
is the timestamp in Unix format like1748413179
A sample of the message is like webserver.get_todo_by_id.success 1.0 1748413179
, followed by a \n
character.
The conn
parameter is a simple net/TCP, instantiated in the init()
function of the cmd/webserver/main.go
file:
func init() {
graphiteHost := config.GetEnvOrDefault("GRAPHITE_HOSTNAME", "graphite")
graphitePort := config.GetEnvOrDefault("GRAPHITE_PORT", "2003")
conn, err := net.Dial("tcp", net.JoinHostPort(graphiteHost, graphitePort))
if err != nil {
panic(err)
}
todoHandler = todos.NewTodoHandler(conn)
if todoHandler == nil {
panic("could not start the application")
}
}
graphite
is the name of the Graphite container we're going to use. Let's see how we can power up our simple yet effective application.
Containerize the Web Server π
The Dockerfile is pretty basic, so I won't spend time covering it.
FROM golang:1.24-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod tidy && go mod download
RUN go mod verify
COPY . .
RUN go build -o webserver cmd/webserver/main.go
FROM alpine
COPY --from=build /app/webserver /webserver
EXPOSE 8080
CMD [ "./webserver" ]
What's most interesting to cover is the Docker Compose file we will be using to start the two containers at once.
Power Up π
To coordinate the startup of the containers, we will use the docker-compose.yml
file. The content is below:
services:
webserver:
build: "."
container_name: webserver
restart: always
environment:
- GRAPHITE_HOSTNAME=graphite
- GRAPHITE_PLAINTEXT_PORT=2003
ports:
- 8080:8080
depends_on:
graphite:
condition: service_healthy
networks:
- todo-network
graphite:
image: graphiteapp/graphite-statsd
container_name: graphite
restart: always
ports:
- 80:80
- 2003-2004:2003-2004
- 2023-2024:2023-2024
- 8125:8125/udp
- 8126:8126
healthcheck:
test: ["CMD-SHELL", "netstat -an | grep -q 2003"]
interval: 10s
retries: 3
start_period: 30s
timeout: 10s
networks:
- todo-network
networks:
todo-network:
driver: bridge
Pay attention to the following key points:
- the
todo-network
was created to make communication between containers possible - the
environment
value used in thewebserver
service to refer to thegraphite
service - the
depends_on
condition withservice_healthy
value defined in thewebserver
service - the
ports
mapped in thegraphite
service (you can map only the ones you need) - the
healthcheck
defined in thegraphite
service that links to thedepends_on
condition defined above
Now, let's see if we can overcome the initial issue of being able to test the correct Graphite metrics emission.
The Test Code πͺ
Here, the big player is testcontainers-go
. If you're curious and want to learn more about it, take a look at the documentation.
I'm a huge fan of this package, and I believe it's something you must try in your next project. Let's engage in a discussion if you want to find out more about how I use it in my projects.
With this package, I'm able to spawn a fresh new webserver
and graphite
containers on each test run. This helps to correctly assess metrics. It provides better isolation and control of what's happening with the Graphite container.
The code I used to interact with the Docker containers is contained in the tests/container.go
file:
package tests
import (
"context"
"os"
"testing"
"github.com/stretchr/testify/require"
tc "github.com/testcontainers/testcontainers-go/modules/compose"
)
func spawnWebServerContainer(t *testing.T) {
t.Helper()
os.Setenv("TESTCONTAINERS_RYUK_DISABLED", "true")
compose, err := tc.NewDockerComposeWith(tc.WithStackFiles("../docker-compose.yml"))
require.NoError(t, err)
t.Cleanup(func() {
require.NoError(t, compose.Down(context.Background(), tc.RemoveOrphans(true), tc.RemoveImagesLocal))
})
ctx, cancel := context.WithCancel(context.Background())
t.Cleanup(cancel)
err = compose.
Up(ctx, tc.Wait(true))
require.NoError(t, err)
}
It uses the docker-compose.yml
to spin up the containers we need in our Integration Tests. It will also add the cleanup code.
Test the HTTP Handler
The test code for the GetTodoByID
handler resides in the tests/get_todo_by_id_test.go
file.
An extract of its content is:
func TestGetTodoByID(t *testing.T) {
spawnWebServerContainer(t)
client := http.Client{}
// ... success scenario omitted for brevity
t.Run("Invalid ID", func(t *testing.T) {
r, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "http://127.0.0.1:8080/todo?id=abc", nil)
require.NoError(t, err)
res, err := client.Do(r)
require.NoError(t, err)
require.Equal(t, http.StatusBadRequest, res.StatusCode)
baseUrl, err := url.Parse("http://127.0.0.1:80/render")
require.NoError(t, err)
params := url.Values{}
params.Add("target", "webserver.get_todo_by_id.errors.invalid_id")
params.Add("from", "-5min")
params.Add("format", "json")
baseUrl.RawQuery = params.Encode()
require.NoError(t, err)
r, err = http.NewRequestWithContext(context.Background(), http.MethodGet, baseUrl.String(), nil)
require.NoError(t, err)
require.EventuallyWithT(t, func(collect *assert.CollectT) {
isMetricEmitted, err := isMetricEmitted(client, r, "webserver.get_todo_by_id.errors.invalid_id", 1)
require.NoError(collect, err)
require.True(collect, isMetricEmitted)
}, time.Second*30, time.Second*3, "metric not emitted enough times")
})
t.Run("Missing ID", func(t *testing.T) {
r, err := http.NewRequestWithContext(context.Background(), http.MethodGet, "http://127.0.0.1:8080/todo?id=", nil)
require.NoError(t, err)
res, err := client.Do(r)
require.NoError(t, err)
require.Equal(t, http.StatusBadRequest, res.StatusCode)
baseUrl, err := url.Parse("http://127.0.0.1:80/render")
require.NoError(t, err)
params := url.Values{}
params.Add("target", "webserver.get_todo_by_id.errors.missing_id")
params.Add("from", "-5min")
params.Add("format", "json")
baseUrl.RawQuery = params.Encode()
require.NoError(t, err)
r, err = http.NewRequestWithContext(context.Background(), http.MethodGet, baseUrl.String(), nil)
require.NoError(t, err)
require.EventuallyWithT(t, func(collect *assert.CollectT) {
isMetricEmitted, err := isMetricEmitted(client, r, "webserver.get_todo_by_id.errors.missing_id", 1)
require.NoError(collect, err)
require.True(collect, isMetricEmitted)
}, time.Second*30, time.Second*3, "metric not emitted enough times")
})
}
This code is not test-ready. The focus is on:
- the
Graphite
request to get back the raw metrics. We're targeting the/render
API with a bunch of values:-
target
is the name of the metric -
from
is self-explanatory. It could have been omitted, and, in this case, it would have defaulted to 24 hours. This value is used to adjust the precision of the retrieved data points. Setting it too high or too low could filter out the data points we need -
format
could have been several other formats such as csv, raw, png, json, and so on
-
- the
isMetricEmitted
function is used to issue the HTTP request to Graphite. More details on it below
Let's see the code interacting with Graphite.
The Metrics Checker
The code is contained in the tests/metrics.go
file:
type graphiteDataPoints []struct {
Target string `json:"target"`
Tags struct {
Name string `json:"name"`
} `json:"tags"`
Datapoints [][2]any `json:"datapoints"`
}
func isMetricEmitted(client http.Client, req *http.Request, metricName string, expectedNumberOfTimes int) (bool, error) {
res, err := client.Do(req)
if err != nil {
return false, err
}
defer res.Body.Close()
if res.StatusCode != http.StatusOK {
return false, fmt.Errorf("expected status code to be 200OK, got %d", res.StatusCode)
}
var graphiteDataPoints graphiteDataPoints
err = json.NewDecoder(res.Body).Decode(&graphiteDataPoints)
if err != nil {
return false, err
}
actualNumberOfTimes := 0
for _, v := range graphiteDataPoints {
if v.Tags.Name == metricName {
for _, vv := range v.Datapoints {
if vv[0] != nil {
actualNumberOfTimes++
}
}
}
}
if actualNumberOfTimes >= expectedNumberOfTimes {
return true, nil
}
return false, fmt.Errorf("metric: %s emitted %d time(s) out of %d", metricName, actualNumberOfTimes, expectedNumberOfTimes)
}
This is only a regular HTTP request sending.
The trickiest part has been how to successfully build the HTTP request to send out.
Has Everything Worked as Expected? βοΈ
To ensure we decently wasted invested time, let's run the tests. Use the command:
go test ./tests -tags=integration
And you should have back an output like:
ok github.com/ossan-dev/graphitepoc/tests 45.259s
Our tests have successfully run. I hope you have learned something new today!
Thanks for the attention, folks! If you've got any questions, doubts, feedback, or comments, I'm available to listen and speak together. If you want me to cover some specific concepts, please reach me.
Time to drop the pen and grab a deserved coffee β
Top comments (2)
Nice walkthrough! Fun fact: Graphite was originally developed by Orbitz in 2006 to help monitor their own infrastructure before it became one of the go-to open-source TSDBs for devops teams.
Thanks, @aarongibbs
It's recurring story XD
Some company wants to overcome a need, it creates a tool supposed to be internal, and, then it's getting adopted by half of the world. Does it remind you anything? :-)