Backup & Recovery¶
Backup Fundamentals [B]¶
Key Concepts¶
| Term | Definition |
|---|---|
| RPO | Recovery Point Objective — max acceptable data loss (e.g., 1 hour means you can lose up to 1 hour of data) |
| RTO | Recovery Time Objective — max acceptable downtime during recovery |
| PITR | Point-in-Time Recovery — restore to any specific moment |
| WAL | Write-Ahead Log — PostgreSQL transaction log, enables PITR |
| Full backup | Complete copy of the database |
| Incremental backup | Only changes since last backup |
| Logical backup | SQL dump (pg_dump), portable, slow for large DBs |
| Physical backup | Binary copy of data files, fast, DB-version specific |
Backup Strategy by Tier¶
| Tier | Example | RPO | RTO | Strategy |
|---|---|---|---|---|
| Critical | Payment DB, user accounts | Near-zero | < 1 hour | Continuous WAL archival + PITR |
| Standard | Product catalog, orders | 1-4 hours | < 4 hours | Hourly snapshots + daily full |
| Low | Analytics, logs | 24 hours | < 24 hours | Daily backup |
PostgreSQL Backup Methods [I]¶
pg_dump (Logical)¶
# Single database dump (compressed)
pg_dump -h localhost -U postgres -d mydb -F c -f mydb_backup.dump
# Plain SQL format (human-readable, larger)
pg_dump -h localhost -U postgres -d mydb -F p -f mydb_backup.sql
# Schema only
pg_dump --schema-only -d mydb -f schema.sql
# Specific table
pg_dump -t orders -d mydb -f orders_backup.dump
# Restore
pg_restore -h localhost -U postgres -d mydb_new -F c mydb_backup.dump
# For plain SQL:
psql -h localhost -U postgres -d mydb_new < mydb_backup.sql
Limitations:
- pg_dump takes a consistent snapshot but is slow for large DBs
- Restore is sequential (slow for TB-scale)
- Use pgBackRest or Barman for production
pgBackRest (Physical, Production-Grade)¶
# /etc/pgbackrest/pgbackrest.conf
[global]
repo1-path=/backup/pgbackrest
repo1-retention-full=7
repo1-retention-diff=4
[mydb]
pg1-path=/var/lib/postgresql/14/main
pg1-host=postgres-primary
# Full backup
pgbackrest --stanza=mydb backup --type=full
# Differential backup
pgbackrest --stanza=mydb backup --type=diff
# Incremental backup
pgbackrest --stanza=mydb backup --type=incr
# List backups
pgbackrest --stanza=mydb info
# Restore to specific time (PITR)
pgbackrest --stanza=mydb restore \
--target="2024-01-15 14:30:00" \
--target-action=promote
# Verify backup integrity
pgbackrest --stanza=mydb check
Continuous WAL Archival¶
# postgresql.conf — enable WAL archiving
wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=mydb archive-push %p'
This enables PITR to any point in time, not just when backups were taken.
AWS RDS / Aurora Backups [I]¶
Automated Backups¶
resource "aws_db_instance" "main" {
# ...
backup_retention_period = 30 # days
backup_window = "03:00-04:00"
delete_automated_backups = false
deletion_protection = true # prevent accidental deletion
}
RDS PITR: AWS stores transaction logs continuously. Restore to any second within the retention window.
# Restore to specific time via CLI
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier prod-db \
--target-db-instance-identifier prod-db-restored \
--restore-time "2024-01-15T14:30:00Z"
Manual Snapshots¶
# Create snapshot before risky operations
aws rds create-db-snapshot \
--db-instance-identifier prod-db \
--db-snapshot-identifier pre-migration-20240115
# List snapshots
aws rds describe-db-snapshots \
--db-instance-identifier prod-db
# Copy snapshot to another region (DR)
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:us-east-1:123:snapshot:pre-migration \
--target-db-snapshot-identifier pre-migration-dr \
--region us-west-2
Backup Verification [I]¶
A backup that hasn't been tested is not a backup.
Verification checklist: - [ ] Backup completes successfully (alert on failure) - [ ] Backup size is within expected range (alert on anomaly) - [ ] Weekly restore test to isolated environment - [ ] Verify data integrity after restore (row counts, checksums) - [ ] Document and test the end-to-end recovery procedure
# pgBackRest integrity check
pgbackrest --stanza=mydb check
# After restore, verify data
psql -d restored_db -c "SELECT COUNT(*) FROM orders;"
psql -d restored_db -c "SELECT MAX(created_at) FROM orders;"
# Compare with production
diff <(psql -d prod_db -c "\dt" | sort) \
<(psql -d restored_db -c "\dt" | sort)
Disaster Recovery Playbook [A]¶
Scenario 1: Accidental Data Delete / Table Drop¶
# Immediately: stop writes to the DB or put app in maintenance mode
# (Prevents more data from being written, making recovery window cleaner)
# Option A: PITR to just before the incident
pgbackrest --stanza=mydb restore \
--target="2024-01-15 13:58:00" \ # 2 minutes before drop
--target-action=promote \
--db-path=/var/lib/postgresql/recovered
# Option B: Selective restore of just the table
# 1. Restore to a separate server
# 2. pg_dump just the affected table
# 3. pg_restore into production
# Option C: Extract from logical backup
pg_restore -t orders -d prod_db orders_backup.dump
# After recovery: run postmortem, add DROP TABLE protection
REVOKE DROP ON TABLE orders FROM app_user;
Scenario 2: Primary DB Failure¶
Primary fails
↓
Replica promoted to primary (manual or automatic)
↓
Application connection string updated (DNS failover or PgBouncer reconfigure)
↓
New replica provisioned to replace old primary
↓
Postmortem on root cause
With AWS RDS Multi-AZ: automatic failover in 60-120 seconds. With Aurora: automatic failover in ~30 seconds.
Scenario 3: Corruption / Bad Migration¶
# Take immediate snapshot of current state (even if corrupted — for analysis)
aws rds create-db-snapshot --db-instance-identifier prod-db \
--db-snapshot-identifier corruption-incident-20240115
# Restore to pre-migration state
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier prod-db \
--target-db-instance-identifier prod-db-restored \
--restore-time "2024-01-15T09:00:00Z" # before migration ran
Backup Monitoring [I]¶
# Prometheus alert rules
- alert: BackupFailed
expr: pgbackrest_backup_error == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Database backup failed"
runbook: "https://wiki/dbre/backup-failure"
- alert: BackupTooOld
expr: time() - pgbackrest_backup_timestamp_last_full > 86400 # 24 hours
labels:
severity: warning
annotations:
summary: "No successful full backup in 24 hours"
Related Topics¶
- Fundamentals — RTO/RPO in context of DB SLOs
- Migrations & Schema Changes — always backup before migrations
- Scaling Databases — replicas and HA
- SRE: Incident Management — recovery as incident
- Platform: Cloud Infrastructure — RDS, snapshots, EBS
- percona-toolkit — pt-table-checksum for data verification