It was 3 AM when PagerDuty screamed into the void.
Our Sidekiq dashboard showed a nightmare: dozens of workers frozen mid-execution. No errors. No timeouts. Just… silence.
The logs revealed nothing.
After hours of forensic debugging, we found the culprit: an invisible database deadlock—one that didn’t even log its crime.
Here’s how we hunted it down and fixed it for good.
The Crime Scene
Symptoms
- Sidekiq jobs hung indefinitely (but only sometimes).
- PostgreSQL logs showed no deadlocks (
log_lock_waits
was off—mistake #1). - Workers stuck on trivial queries (e.g.,
UPDATE orders SET status = 'processed'
).
The Root Cause
A deadly embrace:
-
Job A locked
Order #123
(viawith_advisory_lock
). -
Job B (in the same process) tried to update
Order #123
in a transaction. - Job A then queried another table… which Job B had already locked.
Deadlock. But PostgreSQL didn’t report it because advisory locks bypass the deadlock detector.
The Investigation
1. Diagnosing the Invisible
We used:
SELECT pid, query, wait_event_type, wait_event
FROM pg_stat_activity
WHERE state = 'active';
Found: Workers waiting on Lock
events—but no blocking queries in sight.
Lesson: Always enable in postgresql.conf
:
log_lock_waits = on
deadlock_timeout = 1s
2. The with_advisory_lock
Trap
This convenience method turns deadly when:
- Used inside transactions.
- Nested with other locks.
Bad:
Order.transaction do
with_advisory_lock("order_#{id}") do
# ...
end
end
Good:
with_advisory_lock("order_#{id}") do
Order.transaction do
# ...
end
end
The Fixes
1. Shorten Transactions
- Move non-critical ops (e.g., logging) outside transactions.
- Split mega-jobs into idempotent micro-jobs.
2. Use SKIP LOCKED
for Queues
For job queues competing for the same rows:
Order.lock("FOR UPDATE SKIP LOCKED").where(status: "pending").first
3. Nuclear Option: Pessimistic Lock Timeouts
Order.transaction do
Order.lock!("FOR UPDATE NOWAIT") # Fails instantly if locked
rescue ActiveRecord::LockWaitTimeout
retry_later
end
Lessons Learned
- Advisory locks are sharp knives: Never nest them in transactions.
-
PostgreSQL won’t always save you: Enable
log_lock_waits
proactively. - Idempotency is king: Design jobs to survive retries.
Final Tip: Test concurrency locally with:
10.times.map { Thread.new { Job.perform_now } }.each(&:join)
Have you battled phantom deadlocks? Share your war stories below.
Read more? Here’s how Shopify detonates deadlocks.
Top comments (0)