Skip to content

RabbitMQ: automatic reconnect with topology re-assert on connection/channel loss#120

Merged
owens1127 merged 2 commits into
mainfrom
copilot/fix-broker-connection-recovery
Apr 24, 2026
Merged

RabbitMQ: automatic reconnect with topology re-assert on connection/channel loss#120
owens1127 merged 2 commits into
mainfrom
copilot/fix-broker-connection-recovery

Conversation

Copilot AI commented Feb 22, 2026

Copy link
Copy Markdown
Contributor

The broker integration had no fault tolerance — a dropped connection or externally deleted queue required a manual restart to recover. This replaces both the connection and queue channel management with resilient implementations.

Connection (connection.ts)

  • Infinite retry loop with jittered exponential backoffmin(1000ms × 2^attempt, 30s) × [0.5, 1.0] jitter prevents thundering herd
  • close/error listeners on every connection; on either event the reference is nulled and reconnect is scheduled
  • destroyed flag$disconnect() exits the retry loop cleanly; concurrent createChannel() calls share a single in-flight connectPromise
  • Configurable via env: RABBITMQ_RETRY_INITIAL_DELAY_MS, RABBITMQ_RETRY_MAX_DELAY_MS, RABBITMQ_HEARTBEAT

Queue channel (queue.ts)

  • Topology re-assert on every channelch.assertQueue(name, { durable: true }) called on each new channel, so queues are recreated if deleted externally
  • close/error listeners on every channel; loss clears the reference so the next send/sendJson lazily recreates and re-asserts
  • Replaced isReady/isConnecting flag pair with a direct channel instance + channelPromise gate for concurrent sends

Tests

  • connection.test.ts: jitteredDelay range bounds, concurrent connect deduplication, reconnect-after-close, $disconnect suppressing future reconnects
  • queue.test.ts: topology assertion on first send, channel reuse across sends, re-assertion after channel loss, concurrent-send deduplication
Original prompt

Goal

Fix the API so message broker connections and queues/exchanges are automatically re-created if they don’t exist or if the connection drops while the API is running. Currently a restart is required.

Repository

  • Repo: Raid-Hub/API
  • Language: TypeScript (100%)

Problem

When the API is running:

  • If the broker connection drops, the API does not recover.
  • If required queues/exchanges/bindings are missing (e.g., deleted externally) the API does not re-assert and re-create them.
  • This requires manually restarting the API to restore processing.

Requirements

  1. Implement resilient broker connectivity:
    • Detect connection/channel close and error events.
    • Reconnect automatically with retry + backoff.
    • Recreate channel(s) after reconnect.
  2. Ensure topology is (re)created:
    • Assert required exchanges/queues/bindings on (re)connect.
    • If queues/exchanges are missing at runtime, recover by re-asserting without requiring a process restart.
  3. Consumers and publishers:
    • Re-register consumers after reconnect.
    • Ensure publishers can continue after reconnect (either by lazily obtaining a healthy channel or by swapping channel references).
    • Avoid duplicate consumers when reconnect happens repeatedly.
  4. Observability:
    • Add clear logs for connect/reconnect attempts, backoff, topology re-assert, and consumer re-registration.
    • Optionally add metrics-friendly counters/hooks if a metrics system exists.
  5. Safety:
    • Handle in-flight messages safely (ack/nack behavior) and avoid message loss due to reconnect loops.
    • Avoid tight reconnect loops; use jittered exponential backoff.
  6. Tests (as feasible):
    • Add unit tests for reconnect/backoff logic and for “ensure topology” being invoked on reconnect.
    • If integration test harness exists for RabbitMQ, add or update integration tests to simulate disconnect and verify recovery.

Deliverables

  • Code changes in Raid-Hub/API implementing the above.
  • Updated documentation/config notes if needed (env vars for retry/backoff, heartbeat).
  • PR should include a concise summary and how to verify (steps to reproduce + confirm fix).

This pull request was created from Copilot chat.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@barecheck

barecheck Bot commented Feb 22, 2026

Copy link
Copy Markdown

Barecheck - Code coverage report

Total: 92.78%

Your code coverage diff: 0.00% ▴

✅ All code changes are covered

… re-assert

Co-authored-by: owens1127 <98496129+owens1127@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix API to auto-recreate message broker connections and queues RabbitMQ: automatic reconnect with topology re-assert on connection/channel loss Feb 22, 2026
Copilot AI requested a review from owens1127 February 22, 2026 04:52
@owens1127 owens1127 marked this pull request as ready for review February 22, 2026 05:00
Copilot AI review requested due to automatic review settings February 22, 2026 05:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements automatic reconnection with topology re-assertion for RabbitMQ connections and channels, addressing fault tolerance issues where dropped connections or deleted queues required manual API restarts.

Changes:

  • Added connection-level resilience with infinite retry loop using jittered exponential backoff (1s to 30s max delay)
  • Implemented channel-level resilience with automatic topology re-assertion (queue recreation) on channel loss
  • Added comprehensive unit tests for both connection and queue resilience behaviors

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/integrations/rabbitmq/connection.ts Replaced simple connection logic with retry loop, connection loss handlers, and configurable backoff
src/integrations/rabbitmq/connection.test.ts Added tests for jittered delay, concurrent connection deduplication, reconnection, and $disconnect
src/integrations/rabbitmq/queue.ts Replaced flag-based channel management with channel instance caching and lazy recreation with topology re-assertion
src/integrations/rabbitmq/queue.test.ts Added tests for topology assertion, channel reuse, re-assertion after loss, and concurrent send deduplication

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +97 to +99
$disconnect(): void {
this.destroyed = true
this.conn?.close()

Copilot AI Feb 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $disconnect() method calls this.conn?.close() which is non-blocking, but it doesn't wait for the close to complete or handle potential errors during close. If close() throws or if there are in-flight operations, they may not be handled gracefully.

Consider using await this.conn?.close() if the close operation is async, or wrapping in try-catch to handle potential errors during disconnection. This is especially important for graceful shutdown scenarios.

Suggested change
$disconnect(): void {
this.destroyed = true
this.conn?.close()
async $disconnect(): Promise<void> {
this.destroyed = true
if (this.conn) {
try {
await this.conn.close()
} catch (err) {
logger.warn(
"RABBITMQ_DISCONNECT_FAILED",
err instanceof Error ? err : new Error(String(err))
)
} finally {
this.conn = null
}
}
Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
const INITIAL_RETRY_DELAY_MS = parseInt(process.env.RABBITMQ_RETRY_INITIAL_DELAY_MS ?? "1000")
const MAX_RETRY_DELAY_MS = parseInt(process.env.RABBITMQ_RETRY_MAX_DELAY_MS ?? "30000")

Copilot AI Feb 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing input validation for environment variables: The parseInt() calls for RABBITMQ_RETRY_INITIAL_DELAY_MS and RABBITMQ_RETRY_MAX_DELAY_MS don't validate that the parsed values are valid numbers. If the environment variable contains non-numeric values, parseInt() will return NaN, which will cause unexpected behavior in the jitteredDelay function (resulting in NaN delays).

Consider adding validation to ensure these values are positive numbers, or at minimum check for isNaN() and provide a fallback or throw a clear error message.

Suggested change
const INITIAL_RETRY_DELAY_MS = parseInt(process.env.RABBITMQ_RETRY_INITIAL_DELAY_MS ?? "1000")
const MAX_RETRY_DELAY_MS = parseInt(process.env.RABBITMQ_RETRY_MAX_DELAY_MS ?? "30000")
function parsePositiveIntEnv(
rawValue: string | undefined,
defaultValue: number,
envName: string
): number {
if (rawValue === undefined) {
return defaultValue
}
const parsed = Number.parseInt(rawValue, 10)
if (Number.isNaN(parsed) || parsed <= 0) {
logger.warn(
"RABBITMQ_INVALID_ENV",
new Error(`Invalid value for ${envName}: ${rawValue}`),
{ envName, rawValue, defaultValue }
)
return defaultValue
}
return parsed
}
const INITIAL_RETRY_DELAY_MS = parsePositiveIntEnv(
process.env.RABBITMQ_RETRY_INITIAL_DELAY_MS,
1000,
"RABBITMQ_RETRY_INITIAL_DELAY_MS"
)
const MAX_RETRY_DELAY_MS = parsePositiveIntEnv(
process.env.RABBITMQ_RETRY_MAX_DELAY_MS,
30000,
"RABBITMQ_RETRY_MAX_DELAY_MS"
)
Copilot uses AI. Check for mistakes.
this.user = args.user
this.password = args.password
this.port = parseInt(args.port.toString())
this.heartbeat = args.heartbeat ?? parseInt(process.env.RABBITMQ_HEARTBEAT ?? "60")

Copilot AI Feb 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to lines 6-7, the heartbeat parsing doesn't validate for NaN. If RABBITMQ_HEARTBEAT contains a non-numeric value, this will pass NaN to the amqplib connect call, which may cause connection issues or unexpected behavior.

Add validation to ensure the heartbeat value is a positive number, with a clear error message if invalid.

Suggested change
this.heartbeat = args.heartbeat ?? parseInt(process.env.RABBITMQ_HEARTBEAT ?? "60")
const envHeartbeatStr = process.env.RABBITMQ_HEARTBEAT
const envHeartbeat = parseInt(envHeartbeatStr ?? "60", 10)
const heartbeat = args.heartbeat ?? envHeartbeat
if (!Number.isFinite(heartbeat) || heartbeat <= 0) {
const providedValue =
args.heartbeat !== undefined ? String(args.heartbeat) : envHeartbeatStr ?? "60"
throw new Error(
`Invalid RabbitMQ heartbeat value "${providedValue}". It must be a positive number.`
)
}
this.heartbeat = heartbeat
Copilot uses AI. Check for mistakes.
Comment on lines +88 to 91
await this.connectPromise
}
const conn = this.conn
if (!conn) {

Copilot AI Feb 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition in connection handling: If the connection is lost immediately after connectPromise completes (line 88) but before this.conn is read (line 90), the handleConnectionLoss event handler could set this.conn to null, causing the check at line 91 to throw "Failed to connect to RabbitMQ" even though reconnection is being initiated.

This is a narrow race window, but consider capturing the connection reference immediately after the await completes to avoid this issue, or add additional null checking with appropriate error messaging.

Copilot uses AI. Check for mistakes.
throw new Error("Failed to create channel")
}

const channel = await this.getChannel()

Copilot AI Feb 22, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getChannel() call replaces the previous error-prone connection logic, but errors from channel creation will propagate to callers. Since the sendBuffer method (lines 71-80, outside the diff) catches and logs errors without rethrowing, send operations will silently fail when channel creation fails. This could lead to message loss during connection issues.

Consider documenting this behavior or ensuring that errors from getChannel() are handled appropriately by the caller to avoid silent failures.

Copilot uses AI. Check for mistakes.
@owens1127 owens1127 merged commit 85e7b57 into main Apr 24, 2026
8 of 10 checks passed
@owens1127 owens1127 deleted the copilot/fix-broker-connection-recovery branch April 24, 2026 05:17
owens1127 added a commit that referenced this pull request Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants