Skip to content

HA & Failover

← DBRE Home | ← Main


HA Architecture Patterns [B]

Three tiers — choose based on your RTO requirement:

Tier 1: No HA (single node)
  Primary ──► application
  Failure: manual restart, minutes to hours of downtime

Tier 2: Warm standby (replica + manual promotion)
  Primary ──► Replica
  Failure: DBRE promotes replica manually, 5–30 min RTO

Tier 3: Automated failover (Orchestrator / MHA / cloud-managed)
  Primary ──► Replica(s) ──► Orchestrator monitors
  Failure: auto-promotion in 30–120 seconds, app reconnects via ProxySQL/VIP

For most production MySQL workloads: Tier 3 with ProxySQL.


Connection Routing: ProxySQL [B]

Applications should never connect directly to the primary IP. ProxySQL sits between the app and MySQL — it knows the current primary and routes writes/reads automatically.

App pods ──► ProxySQL (port 6033) ──► Primary (hostgroup 10)
                                  └──► Replicas (hostgroup 20)
# /etc/proxysql.cnf — minimal config
datadir="/var/lib/proxysql"

mysql_servers =
(
  { address="db-primary", port=3306, hostgroup=10, max_connections=100 },
  { address="db-replica-1", port=3306, hostgroup=20, max_connections=100 },
  { address="db-replica-2", port=3306, hostgroup=20, max_connections=100 }
)

mysql_replication_hostgroups =
(
  { writer_hostgroup=10, reader_hostgroup=20, comment="r2d2" }
)

mysql_query_rules =
(
  { rule_id=1, active=1, match_pattern="^SELECT", destination_hostgroup=20, apply=1 },
  { rule_id=2, active=1, match_pattern=".*",       destination_hostgroup=10, apply=1 }
)

mysql_users =
(
  { username="app", password="...", default_hostgroup=10 }
)

ProxySQL monitors SHOW SLAVE STATUS on all backends every few seconds. On master change, it automatically shifts writes to the new primary — applications keep the same connection string and do not need to reconnect.


Automated Failover: MySQL Orchestrator [I]

Orchestrator is the standard open-source tool for MySQL HA. It monitors replication topology, detects primary failure, and promotes the most up-to-date replica.

# Install
curl -L https://github.com/openark/orchestrator/releases/latest/download/orchestrator-*.tar.gz | tar xz

# Key config: /etc/orchestrator/orchestrator.conf.json
{
  "MySQLTopologyUser": "orchestrator",
  "MySQLTopologyPassword": "use-secrets-manager",
  "DetectClusterAliasQuery": "SELECT 'r2d2'",
  "FailoverOnPromotionViaSlaveLagSeconds": 30,
  "RecoveryPeriodBlockSeconds": 3600,
  "OnFailureDetectionProcesses": [
    "echo 'Failure detected on {failedHost}' | slack-notify"
  ],
  "PostFailoverProcesses": [
    "echo 'Promoted {successorHost} as new primary' | slack-notify"
  ]
}
-- Orchestrator monitoring user (on each MySQL node)
CREATE USER 'orchestrator'@'%' IDENTIFIED BY 'use-secrets-manager';
GRANT SUPER, PROCESS, REPLICATION SLAVE, RELOAD ON *.* TO 'orchestrator'@'%';
GRANT SELECT ON mysql.slave_master_info TO 'orchestrator'@'%';

Orchestrator topology discovery:

# Discover a cluster
orchestrator-client -c discover -i db-primary:3306

# Show current topology
orchestrator-client -c topology -i r2d2

# Output:
# db-primary:3306 [OK,5.7.38,rw,ROW] 0s lag
#   + db-replica-1:3306 [OK,5.7.38,ro,ROW] 1s lag
#   + db-replica-2:3306 [OK,5.7.38,ro,ROW] 2s lag

Manual failover (when automated failover is disabled or you want control):

# Graceful: promote specific replica (primary still alive — maintenance)
orchestrator-client -c graceful-master-takeover-auto -i r2d2 -d db-replica-1:3306

# Forced: primary is dead, promote best replica
orchestrator-client -c recover -i db-primary:3306

Cloud-Managed Failover: RDS / Aurora [B]

# RDS Multi-AZ — automated failover in ~60s
resource "aws_db_instance" "primary" {
  multi_az            = true
  # RDS promotes standby automatically on primary failure
  # DNS endpoint stays the same — apps reconnect to same hostname
}

# Aurora — automated failover in ~30s, up to 15 replicas
resource "aws_rds_cluster" "aurora" {
  engine = "aurora-mysql"
}
resource "aws_rds_cluster_instance" "primary" {
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.xlarge"
  promotion_tier     = 0   # lowest = highest failover priority
}
resource "aws_rds_cluster_instance" "replica" {
  cluster_identifier = aws_rds_cluster.aurora.id
  instance_class     = "db.r6g.large"
  promotion_tier     = 1
}

Aurora cluster endpoint vs. instance endpoints:

Cluster endpoint (writer):  mydb.cluster-xxx.rds.amazonaws.com  → always points to primary
Reader endpoint:             mydb.cluster-ro-xxx.rds.amazonaws.com → load-balanced replicas
Instance endpoint:           mydb-instance-1.xxx.rds.amazonaws.com → specific instance (avoid in app config)

Always use the cluster endpoint in your application — it follows failover automatically.


Failover Runbook [I]

Scenario A: Planned failover (maintenance, upgrade)

# 1. Confirm topology before starting
orchestrator-client -c topology -i r2d2

# 2. Verify replica is caught up (lag = 0)
mysql -h db-replica-1 -e "SHOW SLAVE STATUS\G" | grep Seconds_Behind_Master

# 3. Graceful takeover — Orchestrator handles it
orchestrator-client -c graceful-master-takeover-auto -i r2d2 -d db-replica-1:3306

# 4. Verify promotion
orchestrator-client -c topology -i r2d2
mysql -h db-replica-1 -e "SHOW MASTER STATUS\G"   # should now be primary

# 5. Verify apps reconnected (if using ProxySQL)
mysql -h proxysql -P 6032 -u admin -e "SELECT * FROM runtime_mysql_servers;"

Scenario B: Emergency failover (primary dead)

# 1. Confirm primary is actually dead (not a network blip)
ping db-primary
mysql -h db-primary -e "SELECT 1" --connect-timeout=5

# 2. Check if Orchestrator already promoted a replica
orchestrator-client -c topology -i r2d2

# 3. If not auto-promoted, force recovery
orchestrator-client -c recover -i db-primary:3306

# 4. Verify new primary is healthy
mysql -h db-replica-1 -e "SHOW MASTER STATUS\G"
mysql -h db-replica-1 -e "SELECT @@read_only;"   # must be 0

# 5. Verify ProxySQL rerouted (or update ProxySQL manually if needed)
mysql -h proxysql -P 6032 -u admin \
  -e "UPDATE mysql_servers SET hostgroup_id=10 WHERE hostname='db-replica-1';
      LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;"

# 6. Check all remaining replicas are replicating from new primary
mysql -h db-replica-2 -e "SHOW SLAVE STATUS\G" | grep -E "Master_Host|Running|Behind"

# 7. Provision replacement replica for the failed primary
# (see: provisioning a new replica below)

Scenario C: Workers didn't reconnect after failover

# Verify new primary is healthy first
mysql -h proxysql -e "SELECT @@hostname, @@read_only;"

# Restart affected namespace (faster than GitOps release)
kubectl rollout restart deployment -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>

# Verify workers reconnected
kubectl logs -n <namespace> -l app=<worker> --since=2m | grep -i "connect"

Provisioning a New Replica [I]

After a failover, provision a replacement replica to restore redundancy:

# 1. Take a backup from the new primary (or existing replica — less load)
mysqldump --single-transaction --master-data=2 \
  -h db-replica-1 -u backup mydb > /tmp/backup.sql

# Or faster with xtrabackup:
xtrabackup --backup --target-dir=/tmp/xtrabackup --host=db-replica-1

# 2. Restore on new instance
mysql -h new-replica < /tmp/backup.sql

# 3. Configure replication (binlog position is in the dump header)
# --master-data=2 writes it as a comment: CHANGE MASTER TO MASTER_LOG_FILE='...', MASTER_LOG_POS=...
mysql -h new-replica -e "
  CHANGE MASTER TO
    MASTER_HOST='db-replica-1',
    MASTER_USER='replication',
    MASTER_PASSWORD='use-secrets-manager',
    MASTER_LOG_FILE='mysql-bin.000042',
    MASTER_LOG_POS=12345;
  START SLAVE;"

# 4. Verify replication running and lag closing
mysql -h new-replica -e "SHOW SLAVE STATUS\G" | grep -E "Running|Behind"

# 5. Register with Orchestrator
orchestrator-client -c discover -i new-replica:3306

Application Connection Resilience [B]

The failover incident from the r2d2 cluster happened because apps held stale connections. Two layers of defense:

Layer 1: Connection pool validation (detect stale connections before use)

# SQLAlchemy
engine = create_engine(
    "mysql+pymysql://app:pass@proxysql:6033/mydb",
    pool_pre_ping=True,       # SELECT 1 before each checkout — detects dead connections
    pool_recycle=3600,        # recycle connections every hour regardless
    pool_size=10,
    max_overflow=20,
)
# Django settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'HOST': 'proxysql',
        'PORT': '6033',
        'CONN_MAX_AGE': 60,        # don't hold connections forever
        'OPTIONS': {
            'connect_timeout': 10,
            'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        }
    }
}

Layer 2: Retry on transient connection errors

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import pymysql

@retry(
    retry=retry_if_exception_type((pymysql.OperationalError, pymysql.InterfaceError)),
    wait=wait_exponential(multiplier=1, min=1, max=10),
    stop=stop_after_attempt(5)
)
def db_query(session, stmt):
    return session.execute(stmt)

MySQL Version Upgrades [I]

Major version upgrades (5.7 → 8.0) are high-risk operations. Use blue/green to preserve a rollback window.

# Blue/green upgrade via replication
# 1. Set up a replica running the NEW version
mysqldump --single-transaction --master-data=2 -h primary-5.7 mydb > /tmp/backup.sql
mysql -h new-8.0-instance < /tmp/backup.sql
# Configure replication: new-8.0-instance replicates from primary-5.7

# 2. Let it catch up, verify no replication errors
mysql -h new-8.0-instance -e "SHOW SLAVE STATUS\G"

# 3. Run application against new-8.0-instance in staging — check for:
#    - Deprecated SQL syntax (SHOW WARNINGS)
#    - Authentication plugin changes (caching_sha2_password vs mysql_native_password)
#    - Removed variables (query_cache_* no longer exist in 8.0)

# 4. Cutover: stop writes to old primary, let new-8.0 catch up, promote
SET GLOBAL read_only = ON;   -- on old primary
# Wait for Seconds_Behind_Master = 0 on new-8.0
orchestrator-client -c graceful-master-takeover-auto -i cluster -d new-8.0:3306

# 5. Keep old 5.7 primary as a replica for 24h — easy rollback if needed

MySQL 8.0 common upgrade gotchas:

Issue Check
query_cache_* variables removed Remove from my.cnf before upgrade
utf8mb3utf8mb4 default Test collation-sensitive queries and indexes
caching_sha2_password default auth Update client drivers and connection strings
ONLY_FULL_GROUP_BY stricter Audit GROUP BY queries in slow log
Reserved words expanded Check column/table names against new reserved words list