Problem Summary
My automated PostgreSQL backup script using pg_dump was working fine until a few days ago. Now it consistently fails with a connection error during the COPY operation on the alternate tables each time.
Environment Details
- PostgreSQL Version: 9.6.24
- Client: Running pg_dump via Docker container
abc/postgres:9.6.24_0.3 - Database Server: fossology.com:1234
- OS: Linux (backup server) ubuntu dist
- Network: Corporate network with potential firewalls/proxies
Error Message
pg_dump: Dumping the contents of table "author" failed: PQgetCopyData() failed.
pg_dump: Error message from server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_dump: The command was: COPY public.author (author_pk, agent_fk, pfile_fk, content, hash, type, copy_startbyte, copy_endbyte, is_enabled) TO stdout;
Current Backup Script
#!/bin/bash
backup_path=/fossology_backup/pg_dumps
docker_image=abc/postgres:9.6.24_0.3
backup_file=fy_03.$(date +%Y%m%d_%H%M).gz
error_file=fy_03.$(date +%Y%m%d_%H%M).err
db_host=fossology.com
db_user=fly
db_name=fy
db_port=1234
docker run --rm -v $backup_path:$backup_path -v /root/.pgpass:/root/.pgpass $docker_image \
pg_dump -h $db_host -U $db_user -d $db_name -p $db_port -F c \
-f $backup_path/$backup_file 2> $backup_path/$error_file
What I've Tried
- Connection testing: Basic
psqlconnections work fine - Retry logic: Added retry mechanisms with delays - still fails consistently
- Different pg_dump options: Tried various formats (
-F p,-F d), verbose mode, different timeout settings - Database maintenance: Attempted
VACUUM FULL
Additional Observations
- The backup worked perfectly until 3-4 days ago
- No recent changes to the backup script or environment
- The failure always occurs at talternate tables
- Was able to take the backup inside the docker container
- Backup is performed with NFS
Questions
- What could cause PostgreSQL to suddenly start dropping connections during COPY operations?
- Are there any PostgreSQL server-side settings or limits that might cause this behavior?
- How can I diagnose if this is a table corruption issue vs. a network/timeout issue?
- What alternative backup strategies would work around this specific issue?
Any insights into potential root causes or alternative approaches would be greatly appreciated!