Database Reliability Fundamentals¶
DBRE Core Responsibilities [B]¶
- Availability — databases are up and serving queries
- Performance — queries run within acceptable latency
- Durability — data is not lost
- Correctness — data is consistent and trustworthy
- Operability — DBs can be managed, monitored, migrated safely
Database SLOs [B]¶
Databases need their own SLOs, separate from application SLOs:
| SLI | Example Target |
|---|---|
| Availability | 99.95% uptime (22 min/month budget) |
| Query latency (p99) | < 100ms for OLTP queries |
| Replication lag | < 5 seconds |
| Backup success rate | 100% (every backup must succeed) |
| Recovery time | RTO < 4 hours |
→ See SRE: SLOs / SLIs / SLAs
Key Metrics to Monitor [B]¶
PostgreSQL¶
-- Active connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
-- Long-running queries
SELECT pid, now() - query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 minutes'
ORDER BY duration DESC;
-- Table bloat / dead rows
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric/nullif(n_live_tup,0)*100, 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Index usage
SELECT schemaname, tablename, indexname,
idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
-- Replication lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
MySQL / Aurora¶
-- Show running queries
SHOW PROCESSLIST;
SHOW FULL PROCESSLIST;
-- InnoDB status (locks, transactions)
SHOW ENGINE INNODB STATUS\G
-- Slow queries
SELECT * FROM information_schema.processlist
WHERE time > 30 ORDER BY time DESC;
Key Metrics Dashboard¶
| Metric | Alert threshold | Tool |
|---|---|---|
| Connection count | > 80% of max_connections | Prometheus pg_exporter |
| Query p99 latency | > 500ms | APM / slow query log |
| Replication lag | > 30s | pg_exporter, CloudWatch |
| CPU utilization | > 80% sustained | CloudWatch, Datadog |
| Disk usage | > 75% | Node exporter |
| Vacuum last run | > 24 hours on busy tables | pg_stat_user_tables |
| Lock waits | Any > 30 seconds | pg_locks |
Connection Pooling [B]¶
Problem: Each DB connection costs memory (~5-10MB for PostgreSQL). Applications often create hundreds of connections.
Solution: Connection pool — a small set of persistent connections shared across many app instances.
PgBouncer (PostgreSQL)¶
# pgbouncer.ini
[databases]
mydb = host=postgres-primary port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction # or session, statement
max_client_conn = 1000
default_pool_size = 20
max_db_connections = 100
Pool modes:
- transaction — connection returned to pool after each transaction (most efficient)
- session — connection held for entire session (safer, less efficient)
- statement — connection returned after each statement (breaks transactions)
Use transaction mode unless you use session-level features (temp tables, advisory locks, SET commands).
ProxySQL (MySQL)¶
- Connection multiplexing
- Query routing (read/write split)
- Query rules and caching
- Health checks and failover
Locks & Deadlocks [I]¶
Detecting Locks (PostgreSQL)¶
-- Who is waiting for locks?
SELECT
blocking.pid AS blocking_pid,
blocked.pid AS blocked_pid,
blocking.query AS blocking_query,
blocked.query AS blocked_query,
blocked.wait_event_type,
blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';
Killing a Blocking Query¶
-- Soft cancel (waits for safe point)
SELECT pg_cancel_backend(pid);
-- Hard terminate (use carefully)
SELECT pg_terminate_backend(pid);
Deadlock Prevention¶
- Acquire locks in consistent order (always lock table A before table B)
- Keep transactions short
- Use
NOWAITorSKIP LOCKEDfor queue processing - Avoid long transactions that hold locks
ACID Properties [B]¶
| Property | Meaning |
|---|---|
| Atomicity | Transaction is all-or-nothing |
| Consistency | DB moves from one valid state to another |
| Isolation | Concurrent transactions don't see each other's partial work |
| Durability | Committed data persists even after crash |
Isolation Levels¶
| Level | Prevents | Allows |
|---|---|---|
| Read Uncommitted | — | Dirty reads, non-repeatable reads, phantoms |
| Read Committed | Dirty reads | Non-repeatable reads, phantoms |
| Repeatable Read | Dirty + non-repeatable reads | Phantoms |
| Serializable | All anomalies | — (lowest throughput) |
PostgreSQL default: Read Committed. Use Repeatable Read or Serializable for financial/critical data.
Percona Toolkit [I]¶
Percona Toolkit provides battle-tested tools for MySQL/PostgreSQL operations.
# Check for duplicate indexes
pt-duplicate-key-checker --host=db.example.com
# Analyze slow query log
pt-query-digest /var/log/mysql/slow.log
# Online schema change (zero-downtime ALTER TABLE for MySQL)
pt-online-schema-change \
--alter "ADD INDEX idx_created_at (created_at)" \
D=mydb,t=orders
# Table checksum (verify replica data matches primary)
pt-table-checksum --host=primary.db.example.com