DEV Community

Alex Aslam
Alex Aslam

Posted on

When Event Sourcing Fails: War Stories from Production

"We built the perfect event-sourced system—until it exploded in ways we never imagined."

Event sourcing promises immutable truth, time-travel debugging, and resilience. But in production? Things go wrong.

Here are real war stories—names changed to protect the guilty—and what we learned.


1. The Case of the Unbounded Event Stream

What Happened?

A financial app stored every price tick (millions/day) as events. Rebuilding account balances required replaying 3TB of data. Queries took minutes.

The Fix

  • Snapshots: Saved state every 10K events.
  • Archival: Moved older events to cold storage (S3).
  • Aggregates: Pre-computed daily summaries.

Lesson: Not all events deserve equal treatment.


2. The Schema Change That Broke Time

What Happened?

A team added a discount_code field to OrderPlaced events. Old projections ignored it—until a replay applied 2024 logic to 2022 data, giving customers unintended discounts.

The Fix

  • Upcasters: Wrote migration scripts for old events.
  • Versioned Projections: Tagged each with schema compatibility.

Lesson: Events are forever. Plan for evolution.


3. The Infinite Loop of Doom

What Happened?

A UserUpdated event triggered a ProfileUpdated event, which triggered another UserUpdatedad infinitum. The system processed 500K events/hour until OOM killed it.

The Fix

  • Causation IDs: Tracked event chains to detect loops.
  • Idempotency Keys: Prevented duplicate processing.

Lesson: Events can be recursive. Design defensively.


4. The GDPR Request That Broke History

What Happened?

A user demanded data deletion. But "forgetting" their events broke audit trails and projections.

The Fix

  • Pseudonymization: Replaced PII with tokens.
  • Legal Hold Buckets: Segregated sensitive events.

Lesson: Compliance isn’t retroactive. Build it in early.


5. The Slow-Motion Rollback

What Happened?

A bug corrupted projections. The team replayed 2 weeks of events—taking 18 hours and missing SLAs.

The Fix

  • Parallel Replays: Split streams by aggregate ID.
  • Blue/Green Projections: Maintained two versions during recovery.

Lesson: Disaster recovery isn’t optional.


When Event Sourcing Is the Wrong Tool

Low-value data: CRUD apps with no audit needs.
Latency-sensitive systems: Rebuilding state adds overhead.
Teams without DevOps: Requires monitoring, backups, and tooling.


How to Fail Gracefully

  1. Start small: One stream, one projection.
  2. Monitor event growth: Alert on abnormal volumes.
  3. Practice recovery: Regularly test replay scenarios.

"But Our Events Are Fine!"

Famous last words. The best event-sourced systems plan for failure.

Have your own horror story? Share it below—let’s learn together.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.