On-Call & Runbooks¶
On-Call Basics [B]¶
On-call = you're responsible for responding to production incidents outside business hours.
Healthy on-call: - < 2 incidents per shift that require action - < 25% of on-call time spent on incidents - Runbooks exist for all common alerts - Primary + secondary rotation (backup always available)
Unhealthy on-call: - Alert storms, constant pages - No runbooks → tribal knowledge only - Single-person rotation, no backup - Same incidents repeat (no postmortems)
Rotation Structure [B]¶
Week 1: Alice (primary) → Bob (secondary/backup)
Week 2: Bob (primary) → Carol (secondary)
Week 3: Carol (primary) → Alice (secondary)
Tools: PagerDuty, OpsGenie, VictorOps
Best practices: - Handoff meeting/notes at rotation change - Always have a secondary who can be escalated to - Adjust rotations for time zones (follow-the-sun) - Compensate on-call fairly (time off, extra pay)
Escalation Policies [I]¶
Alert fires
→ Primary on-call (2 min to ack)
→ Secondary on-call (5 min)
→ Engineering Manager (10 min)
→ VP Engineering (15 min, SEV1 only)
Configure this in PagerDuty/OpsGenie. Never let an alert die in silence.
Runbooks [I]¶
A runbook is a step-by-step guide for responding to a specific alert.
Runbook Template¶
# Runbook: [Alert Name]
## Alert
**Name:** high_error_rate_payment_service
**Fires when:** Error rate > 1% for 5 minutes
**Severity:** SEV2
## Impact
Payment processing is degraded. Users cannot complete purchases.
## Quick Checks
1. Check service health: `kubectl get pods -n payments`
2. Check recent deployments: `kubectl rollout history deployment/payment-service`
3. Check DB connectivity: [link to DB dashboard]
4. Check upstream dependencies: [Stripe status page]
## Diagnosis Steps
1. Check error logs:
```
kubectl logs -n payments -l app=payment-service --since=15m | grep ERROR
```
2. Check if errors correlate with a recent deploy (check deployment annotation in Grafana)
3. Check DB slow query log → [DBRE runbook link]
## Mitigation Options
### Option A: Rollback (if recent deploy)
```bash
kubectl rollout undo deployment/payment-service -n payments
Option B: Scale up (if load-related)¶
Option C: Disable feature flag¶
Go to LaunchDarkly → disable new_payment_flow
Resolution¶
Confirm error rate < 0.1% for 10 minutes. Close incident. File postmortem if SEV1/SEV2.
Contacts¶
- Payment team Slack: #team-payments
- DB issues: #team-dbre or DBRE on-call
- Stripe issues: Stripe status
Related Runbooks¶
Alert Quality [I]¶
Every alert should answer: 1. What is broken? 2. Who is affected? 3. What to do (link to runbook)?
Alert review checklist: - [ ] Has a runbook link - [ ] Severity is correctly set - [ ] Not a duplicate of another alert - [ ] Has been actionable in the last 30 days - [ ] Fires on symptoms, not causes
Kill alerts that: - Fire repeatedly but resolve on their own - Require no action - Are always suppressed/silenced
On-Call Hygiene [I]¶
Before your shift: - Read handoff notes from previous on-call - Check if there are open incidents/degradations - Verify your pager is working (test page yourself) - Make sure laptop is charged, VPN accessible
During your shift: - Keep phone nearby (silent but vibrate) during off-hours - Log every page and action taken (even if auto-resolved) - If unsure, escalate — don't hero-debug for 2 hours alone
After your shift: - Write handoff notes - File tickets for recurring issues - Propose runbook improvements for anything that confused you
Reducing On-Call Burden [A]¶
Priority order:
- Eliminate alert — is it even necessary?
- Automate response — can a bot handle it?
- Improve runbook — reduce time-to-mitigate
- Fix root cause — so it never fires again
- Tune threshold — reduce false positives
Auto-remediation examples: - Auto-restart pod on OOMKill - Auto-scale on CPU > 80% - Auto-rollback on error rate spike post-deploy
Related Topics¶
- Incident Management
- Observability: Alerting
- SLOs / SLIs / SLAs
- Platform: CI/CD — deploy safety, canary releases
- devops-exercises