Sagas to the Rescue: The Perils of Partial Success

#microservices #go

Let’s say you're building a system that processes large files by breaking them into chunks, generating vector embeddings (for search or AI tasks), and storing metadata in a database.

"It works on my machine!" — the infamous last words before production chaos.

Picture this:

Your local environment hums along perfectly. API calls complete in milliseconds. Chunks of data ingest smoothly. Life is good.

Then you deploy to the cloud.

A network hiccup. A delayed embedding request. A context timeout. The job reschedules—only to fail again with:

ERROR: chunk 0: failed to insert vector - already exists

Now your system is stuck in an infinite retry ♻️ loop, reprocessing the same chunks, hitting duplicate-key errors ⚠️, and leaving behind orphaned entries 🧟 — data that exists in one system but not another.

"I'm sorry, it's lots of intimidating stuff here; let's tackle it piece by piece:"

Vector store — A database that stores high-dimensional data (like embeddings) used in AI and search systems. Think of it like a supercharged search index.
Embedding — A numerical representation of content (like a chunk of text) that captures its meaning, used for similarity search or AI tasks.
Context timeout — A deadline for an operation; if it takes too long, the system cancels it to avoid hanging.
Database transaction — A way to group multiple database operations into a single “all-or-nothing” step.
Orphaned entries — Data stuck in one system (like vectors) that doesn’t match up with metadata in the main database.
Compensating transaction — A follow-up action that undoes work if something fails (like deleting data you just wrote).

These are all daily vocab when you try to develop or deploy a RAG for an GenAI Agent.

What Went Wrong?

The problem is partial success:

Your worker ingested chunks 0-13 into the vector store (✅).
Chunk 14 timed out (❌).
The database transaction rolled back (no metadata recorded).
But the vectors remained in the vector store (zombie data).
On retry, the worker reprocessed chunks 0-13, hitting duplicate-key errors.

Result? A distributed mess.

The Root Cause: Missing Atomicity

In a single database, transactions ensure all-or-nothing operations. But in distributed systems:

Vector store (e.g., Pinecone, Weaviate) ≠ SQL database.
No cross-system transactions exist.
Timeouts, crashes, or network issues leave systems inconsistent.

The Fix: Sagas (Compensating Transactions)

Instead of pretending for atomicity, we embrace failure and undo partial work explicitly.

Solution Code (Go)

func (s *Service) IngestChunks(ctx context.Context, chunks []string) error {
    var ingestedIDs []string

    // Start DB transaction WITH rollback handler
    tx, commit, release, err := s.db.WithTransaction(ctx,
        // Register cleanup for vector store
        func() {
            for _, id := range ingestedIDs {
                s.vectorStore.Delete(ctx, id) // Compensate on failure
            }
        },
    )
    if err != nil {
        return err
    }
    defer release() // Auto-rollback + cleanup if commit not called

    // Ingest chunks
    for i, chunk := range chunks {
        vectorID := fmt.Sprintf("%s-%d", fileID, i)
        if err := s.vectorStore.Insert(ctx, vectorID, chunk); err != nil {
            return err // Triggers defer cleanup
        }
        ingestedIDs = append(ingestedIDs, vectorID)
    }

    // Record metadata in SQL (only if vectors succeeded)
    if err := s.saveMetadata(tx, ingestedIDs); err != nil {
        return err // Triggers defer cleanup
    }

    return commit(ctx) // Finalize (or rollback via defer)
}

Why This Works

On Success:
- Vectors inserted ✅
- SQL metadata saved ✅
- Transaction commits ✅
On Failure (timeout, crash, etc.):
- defer release() runs compensation logic
- Vectors deleted (undo partial inserts)
- SQL transaction rolled back
Retry Safety:
- No duplicate-key errors (clean slate on retry)

Key Takeaways

✔ Distributed systems fail partially — and they will! So plan for it.
✔ Compensating transactions (Sagas) undo work explicitly.
✔ Always defer cleanup to handle crashes/timeouts.
✔ Idempotency is crucial for retries.

Moral of the story: