Database Observability¶

The Monitoring Stack `[B]`¶

MySQL / PostgreSQL
       │
       ▼
  Exporter (mysqld_exporter / postgres_exporter)
       │  scrapes every 15s
       ▼
  Prometheus  ──────► Alertmanager ──────► PagerDuty / Slack
       │
       ▼
   Grafana

For RDS/Aurora replace the exporter with CloudWatch — alert structure is the same.

MySQL: mysqld_exporter `[B]`¶

docker run -d \
  -e DATA_SOURCE_NAME="monitoring:password@(localhost:3306)/" \
  -p 9104:9104 \
  prom/mysqld-exporter \
  --collect.info_schema.processlist \
  --collect.info_schema.innodb_metrics \
  --collect.slave_status \
  --collect.binlog_size \
  --no-collect.info_schema.tables   # noisy on large schemas — disable

# prometheus.yml
scrape_configs:
  - job_name: mysql
    static_configs:
      - targets: ['mysqld-exporter:9104']
        labels:
          cluster: r2d2
          role: primary        # label replicas separately
    scrape_interval: 15s

Monitoring user:

CREATE USER 'monitoring'@'%' IDENTIFIED BY 'use-secrets-manager';
GRANT PROCESS, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'monitoring'@'%';
GRANT SELECT ON performance_schema.* TO 'monitoring'@'%';
GRANT SELECT ON sys.* TO 'monitoring'@'%';
FLUSH PRIVILEGES;

Key metrics:

Metric	What it measures
`mysql_up`	Is the instance reachable
`mysql_global_status_threads_connected`	Current connections
`mysql_global_variables_max_connections`	Connection limit
`mysql_global_status_threads_running`	Active queries right now
`mysql_global_status_slow_queries`	Cumulative slow query count
`mysql_global_status_innodb_row_lock_waits`	InnoDB row lock contention
`mysql_global_status_innodb_row_lock_time_avg`	Average lock wait time (ms)
`mysql_slave_status_seconds_behind_master`	Replication lag (replica only)
`mysql_slave_status_slave_io_running`	IO thread running (0 = broken)
`mysql_slave_status_slave_sql_running`	SQL thread running (0 = broken)
`mysql_global_status_binlog_size`	Binary log disk usage
`mysql_global_status_innodb_buffer_pool_read_requests`	Buffer pool reads
`mysql_global_status_innodb_buffer_pool_reads`	Disk reads (cache miss)

MySQL: Alert Hierarchy `[B]`¶

Tier 1 — Page Immediately¶

Alert	Condition	Why
MySQL down	`mysql_up == 0` for > 1 min	Database is down
Replication IO thread stopped	`mysql_slave_status_slave_io_running == 0`	Replica not receiving binlog
Replication SQL thread stopped	`mysql_slave_status_slave_sql_running == 0`	Replica not applying events
Replication lag critical	`mysql_slave_status_seconds_behind_master > 300`	Replica 5+ min behind, unusable for failover
Disk > 90%	`node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.10`	MySQL will crash when binlog fills disk
Connections exhausted	`mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.95`	New connections rejected
Backup missing	Last backup timestamp > 25h	RPO breach risk

Tier 2 — Investigate Within 1 Hour¶

Alert	Condition	Why
Replication lag high	`mysql_slave_status_seconds_behind_master > 30`	Replica diverging
Replication lag growing	`deriv(mysql_slave_status_seconds_behind_master[10m]) > 0` AND lag > 60s	Replica falling further behind
High threads running	`mysql_global_status_threads_running > 20` sustained > 5 min	Query pile-up, possible lock contention
Lock wait spike	`rate(mysql_global_status_innodb_row_lock_waits[5m]) > 10`	Competing transactions
Long lock wait	`mysql_global_status_innodb_row_lock_time_avg > 5000` (ms)	Queries blocked for > 5s on average
Slow query rate rising	`rate(mysql_global_status_slow_queries[5m]) > 1`	New slow queries appearing
CPU > 80% sustained	> 10 min	Saturation approaching
Disk > 80%		Trending toward Tier 1
Connection utilization > 80%		Trending toward Tier 1

Tier 3 — Ticket Within 24 Hours¶

Alert	Condition	Why
Buffer pool hit ratio degrading	< 95%	Working set no longer fits in memory
Binlog disk usage > 50%		binlog retention eating disk
Slow query count week-over-week +20%		Schema or data distribution change
Replica count below expected		Reduced failover coverage

Tier 4 — Weekly Review¶

Signal	Why
Backup size growth rate	Growing faster than data?
`Aborted_connects` rising	Connection errors, auth failures
`Table_locks_waited` increasing	MyISAM table lock contention (should be near zero on InnoDB)

MySQL: PromQL Reference `[I]`¶

Availability and Connections¶

# Is MySQL up?
mysql_up

# Connection utilization %
mysql_global_status_threads_connected
  / mysql_global_variables_max_connections * 100

# Active queries right now (threads_running spikes = pile-up)
mysql_global_status_threads_running

# Connections per second (connection churn)
rate(mysql_global_status_connections[5m])

Replication¶

# Replication lag in seconds (run on replica)
mysql_slave_status_seconds_behind_master

# Alert: lag > 30s
mysql_slave_status_seconds_behind_master > 30

# Alert: lag growing monotonically (replica stuck)
deriv(mysql_slave_status_seconds_behind_master[10m]) > 0
  AND mysql_slave_status_seconds_behind_master > 60

# Both threads must be running (0 = problem)
mysql_slave_status_slave_io_running == 0   # IO thread broken
mysql_slave_status_slave_sql_running == 0  # SQL thread broken

# Compare primary and replica positions (bytes behind)
mysql_slave_status_master_log_pos - mysql_slave_status_read_master_log_pos

Query Performance¶

# Queries per second
rate(mysql_global_status_queries[5m])

# Slow queries per second
rate(mysql_global_status_slow_queries[5m])

# Slow query ratio
rate(mysql_global_status_slow_queries[5m])
  / rate(mysql_global_status_queries[5m])

# Transaction commits/rollbacks per second
rate(mysql_global_status_handlers_total{handler="commit"}[5m])
rate(mysql_global_status_handlers_total{handler="rollback"}[5m])

InnoDB and Locks¶

# Buffer pool hit ratio (target > 99%)
1 - (
  rate(mysql_global_status_innodb_buffer_pool_reads[5m])
  / rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])
)

# Row lock waits per second
rate(mysql_global_status_innodb_row_lock_waits[5m])

# Average row lock wait time (ms) — alert if > 1000ms
mysql_global_status_innodb_row_lock_time_avg

# InnoDB dirty pages ratio
mysql_global_status_innodb_buffer_pool_pages_dirty
  / mysql_global_status_innodb_buffer_pool_pages_total

Disk¶

# Disk usage % on /var/lib/mysql
1 - (
  node_filesystem_avail_bytes{mountpoint="/var/lib/mysql"}
  / node_filesystem_size_bytes{mountpoint="/var/lib/mysql"}
)

# Binlog size (bytes)
mysql_global_status_binlog_size

MySQL: Alerting Rules `[I]`¶

# prometheus/alerts/mysql.yml
groups:
  - name: mysql_availability
    rules:
      - alert: MySQLDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL down on {{ $labels.instance }}"
          runbook: "wiki.internal/dbre/runbooks/mysql-down"

      - alert: MySQLConnectionsExhausted
        expr: >
          mysql_global_status_threads_connected
          / mysql_global_variables_max_connections > 0.90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "MySQL connections at {{ $value | humanizePercentage }} on {{ $labels.instance }}"

  - name: mysql_replication
    rules:
      - alert: MySQLReplicationIOThreadDown
        expr: mysql_slave_status_slave_io_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication IO thread stopped on {{ $labels.instance }}"
          runbook: "wiki.internal/dbre/runbooks/replication-broken"

      - alert: MySQLReplicationSQLThreadDown
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication SQL thread stopped on {{ $labels.instance }}"

      - alert: MySQLReplicationLagHigh
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag {{ $value }}s on {{ $labels.instance }}"

      - alert: MySQLReplicationLagCritical
        expr: mysql_slave_status_seconds_behind_master > 300
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag CRITICAL {{ $value }}s — replica unusable for failover"

      - alert: MySQLReplicaStuck
        expr: >
          deriv(mysql_slave_status_seconds_behind_master[10m]) > 0
          AND mysql_slave_status_seconds_behind_master > 120
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replica falling further behind — may be stuck"

MySQL: Diagnosing Alerts `[I]`¶

MySQL down¶

systemctl status mysql
journalctl -u mysql -n 50

# Check mount — common cause after instance reset
lsblk
cat /var/lib/mysql/mysql-bin.index   # paths must show /var/lib/mysql/

Replication broken¶

-- Check both thread states and last error
SHOW SLAVE STATUS\G

-- Key fields to check:
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes
-- Seconds_Behind_Master: <number>
-- Last_IO_Error: (empty = good)
-- Last_SQL_Error: (empty = good)

-- If SQL thread stopped due to error, skip one event (use carefully):
STOP SLAVE; SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; START SLAVE;

High threads_running / lock waits¶

-- What is running right now?
SELECT id, user, host, db, time, state, LEFT(info, 100) AS query
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;

-- Who is blocking who?
SELECT
  r.trx_id AS waiting_trx,
  r.trx_mysql_thread_id AS waiting_thread,
  b.trx_id AS blocking_trx,
  b.trx_mysql_thread_id AS blocking_thread,
  b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

-- Kill a blocking query
KILL <blocking_thread_id>;

Slow queries¶

-- Top slow queries (requires slow_query_log = ON or performance_schema)
SELECT
  digest_text,
  count_star AS calls,
  round(avg_timer_wait / 1e9, 2) AS avg_ms,
  round(sum_timer_wait / 1e9, 2) AS total_ms
FROM performance_schema.events_statements_summary_by_digest
ORDER BY sum_timer_wait DESC
LIMIT 10;

Grafana Dashboard Structure `[I]`¶

Dashboard 1: MySQL Overview¶

Row 1 — Health
  [Stat] mysql_up    [Stat] Replica lag (max)    [Stat] Threads running

Row 2 — Connections
  [Graph] threads_connected over time
  [Gauge] Connection utilization %

Row 3 — Throughput
  [Graph] Queries/sec    [Graph] Slow queries/sec
  [Graph] Commits/sec    [Graph] Rollbacks/sec

Row 4 — InnoDB
  [Graph] Buffer pool hit ratio (target > 99%)
  [Graph] Row lock waits/sec    [Stat] Avg lock wait ms

Row 5 — Replication
  [Graph] Seconds_behind_master per replica
  [Stat] IO thread    [Stat] SQL thread

Row 6 — Disk
  [Graph] /var/lib/mysql disk usage %
  [Graph] Binlog size

Dashboard 2: Replication Health¶

Row 1 — [Graph] Per-replica lag (one line per replica)
Row 2 — [Stat] IO thread running per replica
Row 3 — [Stat] SQL thread running per replica
Row 4 — [Graph] Replica queries/sec (read load distribution)

PostgreSQL: Key Differences `[B]`¶

For PostgreSQL, use postgres_exporter instead. Core differences from MySQL monitoring:

Concern	MySQL metric	PostgreSQL metric
DB up	`mysql_up`	`pg_up`
Connections	`mysql_global_status_threads_connected`	`pg_stat_activity_count`
Replication lag	`mysql_slave_status_seconds_behind_master`	`pg_replication_lag`
Lock waits	`mysql_global_status_innodb_row_lock_waits`	`pg_locks_count{granted="false"}`
Cache hit	InnoDB buffer pool hit ratio	`pg_stat_database_blks_hit / (blks_hit + blks_read)`
Slow queries	`performance_schema`	`pg_stat_statements` (extension)

PostgreSQL-specific: watch n_dead_tup (bloat) and autovacuum lag — no MySQL equivalent.

Tool Comparison `[B]`¶

Tool	Best for	Limitations
Prometheus + Grafana	Self-managed, EC2-hosted, full control	Setup overhead
CloudWatch	RDS / Aurora, zero setup	Limited granularity, cost at scale
Datadog	Unified app + DB + infra	Cost, vendor lock-in
RDS Performance Insights	Deep query analysis on RDS/Aurora	RDS-only
PMM (Percona Monitoring)	MySQL/PostgreSQL, open-source, rich DB UI	Self-hosted overhead
`pt-query-digest`	Offline slow log analysis	Not real-time
`mytop` / `innotop`	Live MySQL process monitor (like `top`)	Terminal only, no history

PMM (Percona Monitoring and Management) is worth considering for MySQL-heavy environments — it wraps mysqld_exporter + Grafana dashboards + query analytics in one self-hosted package.

Fundamentals — raw monitoring SQL
HA & Failover — replication topology and failover monitoring
Backup & Recovery — backup success alerting
SRE: Observability — application-level observability
percona-toolkit — pt-query-digest, pt-slave-delay