I have set up logical replication from a PostgreSQL 14 instance running on an Azure VM to a PostgreSQL 16 Flexible Server in Azure.
Here’s the setup and process I followed:
Source (VM): PostgreSQL 14, 4 cores, 16GB RAM
Replica (Flexible Server): PostgreSQL 16, 2 cores, 8GB RAM, P6 (240 IOPS) tier
Set max_replication_workers to 10 on the main DB and 8 on the replica
Created two publications on the source:
One publication includes 3 tables
The other publication includes 4 tables
Created two logical replication slots on the main DB (one for each publication)
Took a data dump of these tables and restored them on the replica to ensure identical data
Created two subscriptions on the replica, using the respective replication slots and without initial sync
Replication works fine initially — data changes from the source reflect correctly on the replica.
However, after about a week, replication breaks. At that point:
Refreshing the publication or toggling the subscriptions doesn’t help
The replication cannot be restarted
The only way to fix it is to start over from the dump and reinitialize the replication setup
All replication parameters are fine-tuned, and both servers are on stable connections.
My question:
Why does this replication consistently break after a week? Is this due to version mismatch (PostgreSQL 14 → 16), WAL retention, slot corruption, or something else specific to Azure Flexible Server?
What could be the best way to ensure long-term stability for logical replication between these two servers?