When Event Sourcing Fails: War Stories from Production

#webdev #programming #rails #beginners

"We built the perfect event-sourced system—until it exploded in ways we never imagined."

Event sourcing promises immutable truth, time-travel debugging, and resilience. But in production? Things go wrong.

Here are real war stories—names changed to protect the guilty—and what we learned.

1. The Case of the Unbounded Event Stream

What Happened?

A financial app stored every price tick (millions/day) as events. Rebuilding account balances required replaying 3TB of data. Queries took minutes.

The Fix

Snapshots: Saved state every 10K events.
Archival: Moved older events to cold storage (S3).
Aggregates: Pre-computed daily summaries.

Lesson: Not all events deserve equal treatment.

2. The Schema Change That Broke Time

What Happened?

A team added a discount_code field to OrderPlaced events. Old projections ignored it—until a replay applied 2024 logic to 2022 data, giving customers unintended discounts.

The Fix

Upcasters: Wrote migration scripts for old events.
Versioned Projections: Tagged each with schema compatibility.

Lesson: Events are forever. Plan for evolution.

3. The Infinite Loop of Doom

What Happened?

A UserUpdated event triggered a ProfileUpdated event, which triggered another UserUpdated—ad infinitum. The system processed 500K events/hour until OOM killed it.

The Fix

Causation IDs: Tracked event chains to detect loops.
Idempotency Keys: Prevented duplicate processing.

Lesson: Events can be recursive. Design defensively.

4. The GDPR Request That Broke History

What Happened?

A user demanded data deletion. But "forgetting" their events broke audit trails and projections.

The Fix

Pseudonymization: Replaced PII with tokens.
Legal Hold Buckets: Segregated sensitive events.

Lesson: Compliance isn’t retroactive. Build it in early.

5. The Slow-Motion Rollback

What Happened?

A bug corrupted projections. The team replayed 2 weeks of events—taking 18 hours and missing SLAs.

The Fix

Parallel Replays: Split streams by aggregate ID.
Blue/Green Projections: Maintained two versions during recovery.

Lesson: Disaster recovery isn’t optional.

When Event Sourcing Is the Wrong Tool

❌ Low-value data: CRUD apps with no audit needs.
❌ Latency-sensitive systems: Rebuilding state adds overhead.
❌ Teams without DevOps: Requires monitoring, backups, and tooling.