DEV Community

Cover image for 🧨 What I Broke Wednesday: The Infinite Retry Loop of Doom
Sumit Roy
Sumit Roy

Posted on • Edited on

🧨 What I Broke Wednesday: The Infinite Retry Loop of Doom

This week’s episode: how I accidentally DDoS’d our own backend... from inside the backend.

🛠️ The Setup

We had a microservice A that relied on service B for fetching customer data.

One fine morning, I decided to make our request to service B more “resilient.”

Added a retry block using the new HTTP client wrapper. Something like:

for _ in range(3):
    try:
        response = call_service_b()
        break
    except:
        time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

Looks harmless, right?

Well…

🔥 The Fire

Meme

Here’s what I missed:

  • Service B was already failing due to a separate deployment issue.

  • The retry logic was not exponential, and ran for every request.

  • Service A was deployed to all nodes, each now retrying every request 3 times.

Multiply that by:

  • 5K requests per second

  • 3 retries

  • 6 instances

🎉 Congratulations! You just self-DDoS’d your own service. Both A and B went down in under 90 seconds.

🧠 The Learnings

  • Retries aren’t magic. Add exponential backoff. Add limits. Add circuit breakers. And most importantly add rate limits, yes even for internal requests.

  • Log retries explicitly. So you don’t have to grep through 500MB of logs later.

  • Monitor retry rates—because a spike might be hiding a deeper issue.

✅ The Fix

This Is Fine Meme

We rolled back, patched in exponential backoff + error rate logging.
Also implemented a retry budget: no more than 10% of total requests can be retries.

🧊 TL;DR

Trying to make things more reliable can backfire spectacularly if you forget to put guardrails.

Don’t just retry. Retry wisely.

Top comments (0)