🧨 What I Broke Wednesday: The Infinite Retry Loop of Doom

#programming #microservices #webdev #devops

This week’s episode: how I accidentally DDoS’d our own backend... from inside the backend.

🛠️ The Setup

We had a microservice A that relied on service B for fetching customer data.

One fine morning, I decided to make our request to service B more “resilient.”

Added a retry block using the new HTTP client wrapper. Something like:

for _ in range(3):
    try:
        response = call_service_b()
        break
    except:
        time.sleep(1)

Looks harmless, right?

Well…

🔥 The Fire

Here’s what I missed:

Service B was already failing due to a separate deployment issue.
The retry logic was not exponential, and ran for every request.
Service A was deployed to all nodes, each now retrying every request 3 times.

Multiply that by:

5K requests per second
3 retries
6 instances

🎉 Congratulations! You just self-DDoS’d your own service. Both A and B went down in under 90 seconds.

🧠 The Learnings

Retries aren’t magic. Add exponential backoff. Add limits. Add circuit breakers. And most importantly add rate limits, yes even for internal requests.
Log retries explicitly. So you don’t have to grep through 500MB of logs later.
Monitor retry rates—because a spike might be hiding a deeper issue.

✅ The Fix

We rolled back, patched in exponential backoff + error rate logging.
Also implemented a retry budget: no more than 10% of total requests can be retries.