This week’s episode: how I accidentally DDoS’d our own backend... from inside the backend.
🛠️ The Setup
We had a microservice A that relied on service B for fetching customer data.
One fine morning, I decided to make our request to service B more “resilient.”
Added a retry block using the new HTTP client wrapper. Something like:
for _ in range(3):
try:
response = call_service_b()
break
except:
time.sleep(1)
Looks harmless, right?
Well…
🔥 The Fire
Here’s what I missed:
Service B was already failing due to a separate deployment issue.
The retry logic was not exponential, and ran for every request.
Service A was deployed to all nodes, each now retrying every request 3 times.
Multiply that by:
5K requests per second
3 retries
6 instances
🎉 Congratulations! You just self-DDoS’d your own service. Both A and B went down in under 90 seconds.
🧠 The Learnings
Retries aren’t magic. Add exponential backoff. Add limits. Add circuit breakers. And most importantly add rate limits, yes even for internal requests.
Log retries explicitly. So you don’t have to grep through 500MB of logs later.
Monitor retry rates—because a spike might be hiding a deeper issue.
✅ The Fix
We rolled back, patched in exponential backoff + error rate logging.
Also implemented a retry budget: no more than 10% of total requests can be retries.
🧊 TL;DR
Trying to make things more reliable can backfire spectacularly if you forget to put guardrails.
Don’t just retry. Retry wisely.
Top comments (0)