DEV Community

Alexander Ertli
Alexander Ertli

Posted on

Sagas to the Rescue: The Perils of Partial Success

Let’s say you're building a system that processes large files by breaking them into chunks, generating vector embeddings (for search or AI tasks), and storing metadata in a database.

"It works on my machine!" — the infamous last words before production chaos.

Picture this:

Your local environment hums along perfectly. API calls complete in milliseconds. Chunks of data ingest smoothly. Life is good.

Then you deploy to the cloud.

A network hiccup. A delayed embedding request. A context timeout. The job reschedules—only to fail again with:

ERROR: chunk 0: failed to insert vector - already exists
Enter fullscreen mode Exit fullscreen mode

Now your system is stuck in an infinite retry ♻️ loop, reprocessing the same chunks, hitting duplicate-key errors ⚠️, and leaving behind orphaned entries 🧟 — data that exists in one system but not another.


"I'm sorry, it's lots of intimidating stuff here; let's tackle it piece by piece:"

  • Vector store — A database that stores high-dimensional data (like embeddings) used in AI and search systems. Think of it like a supercharged search index.
  • Embedding — A numerical representation of content (like a chunk of text) that captures its meaning, used for similarity search or AI tasks.
  • Context timeout — A deadline for an operation; if it takes too long, the system cancels it to avoid hanging.
  • Database transaction — A way to group multiple database operations into a single “all-or-nothing” step.
  • Orphaned entries — Data stuck in one system (like vectors) that doesn’t match up with metadata in the main database.
  • Compensating transaction — A follow-up action that undoes work if something fails (like deleting data you just wrote).

These are all daily vocab when you try to develop or deploy a RAG for an GenAI Agent.

What Went Wrong?

The problem is partial success:

  1. Your worker ingested chunks 0-13 into the vector store (✅).
  2. Chunk 14 timed out (❌).
  3. The database transaction rolled back (no metadata recorded).
  4. But the vectors remained in the vector store (zombie data).
  5. On retry, the worker reprocessed chunks 0-13, hitting duplicate-key errors.

Result? A distributed mess.

The Root Cause: Missing Atomicity

In a single database, transactions ensure all-or-nothing operations. But in distributed systems:

  • Vector store (e.g., Pinecone, Weaviate) ≠ SQL database.
  • No cross-system transactions exist.
  • Timeouts, crashes, or network issues leave systems inconsistent.

The Fix: Sagas (Compensating Transactions)

Instead of pretending for atomicity, we embrace failure and undo partial work explicitly.

Solution Code (Go)

func (s *Service) IngestChunks(ctx context.Context, chunks []string) error {
    var ingestedIDs []string

    // Start DB transaction WITH rollback handler
    tx, commit, release, err := s.db.WithTransaction(ctx,
        // Register cleanup for vector store
        func() {
            for _, id := range ingestedIDs {
                s.vectorStore.Delete(ctx, id) // Compensate on failure
            }
        },
    )
    if err != nil {
        return err
    }
    defer release() // Auto-rollback + cleanup if commit not called

    // Ingest chunks
    for i, chunk := range chunks {
        vectorID := fmt.Sprintf("%s-%d", fileID, i)
        if err := s.vectorStore.Insert(ctx, vectorID, chunk); err != nil {
            return err // Triggers defer cleanup
        }
        ingestedIDs = append(ingestedIDs, vectorID)
    }

    // Record metadata in SQL (only if vectors succeeded)
    if err := s.saveMetadata(tx, ingestedIDs); err != nil {
        return err // Triggers defer cleanup
    }

    return commit(ctx) // Finalize (or rollback via defer)
}
Enter fullscreen mode Exit fullscreen mode

Why This Works

  1. On Success:

    • Vectors inserted ✅
    • SQL metadata saved ✅
    • Transaction commits ✅
  2. On Failure (timeout, crash, etc.):

    • defer release() runs compensation logic
    • Vectors deleted (undo partial inserts)
    • SQL transaction rolled back
  3. Retry Safety:

    • No duplicate-key errors (clean slate on retry)

Key Takeaways

Distributed systems fail partially — and they will! So plan for it.
Compensating transactions (Sagas) undo work explicitly.
Always defer cleanup to handle crashes/timeouts.
Idempotency is crucial for retries.

Moral of the story:

"If you can't make it atomic, make it reversible."

Next time your cloud deployment behaves oddly, ask:
"Did I handle partial failures—or just hope they wouldn’t happen?"

Top comments (0)