Let’s say you're building a system that processes large files by breaking them into chunks, generating vector embeddings (for search or AI tasks), and storing metadata in a database.
"It works on my machine!" — the infamous last words before production chaos.
Picture this:
Your local environment hums along perfectly. API calls complete in milliseconds. Chunks of data ingest smoothly. Life is good.
Then you deploy to the cloud.
A network hiccup. A delayed embedding request. A context timeout. The job reschedules—only to fail again with:
ERROR: chunk 0: failed to insert vector - already exists
Now your system is stuck in an infinite retry ♻️ loop, reprocessing the same chunks, hitting duplicate-key errors ⚠️, and leaving behind orphaned entries 🧟 — data that exists in one system but not another.
"I'm sorry, it's lots of intimidating stuff here; let's tackle it piece by piece:"
- Vector store — A database that stores high-dimensional data (like embeddings) used in AI and search systems. Think of it like a supercharged search index.
- Embedding — A numerical representation of content (like a chunk of text) that captures its meaning, used for similarity search or AI tasks.
- Context timeout — A deadline for an operation; if it takes too long, the system cancels it to avoid hanging.
- Database transaction — A way to group multiple database operations into a single “all-or-nothing” step.
- Orphaned entries — Data stuck in one system (like vectors) that doesn’t match up with metadata in the main database.
- Compensating transaction — A follow-up action that undoes work if something fails (like deleting data you just wrote).
These are all daily vocab when you try to develop or deploy a RAG for an GenAI Agent.
What Went Wrong?
The problem is partial success:
- Your worker ingested chunks 0-13 into the vector store (✅).
- Chunk 14 timed out (❌).
- The database transaction rolled back (no metadata recorded).
- But the vectors remained in the vector store (zombie data).
- On retry, the worker reprocessed chunks 0-13, hitting duplicate-key errors.
Result? A distributed mess.
The Root Cause: Missing Atomicity
In a single database, transactions ensure all-or-nothing operations. But in distributed systems:
- Vector store (e.g., Pinecone, Weaviate) ≠ SQL database.
- No cross-system transactions exist.
- Timeouts, crashes, or network issues leave systems inconsistent.
The Fix: Sagas (Compensating Transactions)
Instead of pretending for atomicity, we embrace failure and undo partial work explicitly.
Solution Code (Go)
func (s *Service) IngestChunks(ctx context.Context, chunks []string) error {
var ingestedIDs []string
// Start DB transaction WITH rollback handler
tx, commit, release, err := s.db.WithTransaction(ctx,
// Register cleanup for vector store
func() {
for _, id := range ingestedIDs {
s.vectorStore.Delete(ctx, id) // Compensate on failure
}
},
)
if err != nil {
return err
}
defer release() // Auto-rollback + cleanup if commit not called
// Ingest chunks
for i, chunk := range chunks {
vectorID := fmt.Sprintf("%s-%d", fileID, i)
if err := s.vectorStore.Insert(ctx, vectorID, chunk); err != nil {
return err // Triggers defer cleanup
}
ingestedIDs = append(ingestedIDs, vectorID)
}
// Record metadata in SQL (only if vectors succeeded)
if err := s.saveMetadata(tx, ingestedIDs); err != nil {
return err // Triggers defer cleanup
}
return commit(ctx) // Finalize (or rollback via defer)
}
Why This Works
-
On Success:
- Vectors inserted ✅
- SQL metadata saved ✅
- Transaction commits ✅
-
On Failure (timeout, crash, etc.):
-
defer release()
runs compensation logic - Vectors deleted (undo partial inserts)
- SQL transaction rolled back
-
-
Retry Safety:
- No duplicate-key errors (clean slate on retry)
Key Takeaways
✔ Distributed systems fail partially — and they will! So plan for it.
✔ Compensating transactions (Sagas) undo work explicitly.
✔ Always defer cleanup to handle crashes/timeouts.
✔ Idempotency is crucial for retries.
Moral of the story:
"If you can't make it atomic, make it reversible."
Next time your cloud deployment behaves oddly, ask:
"Did I handle partial failures—or just hope they wouldn’t happen?"
Top comments (0)