Incident Management¶
Incident Lifecycle [B]¶
Severity Levels [B]¶
Define these clearly so everyone knows the response expectations:
| Severity | User Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage | Immediate, 24/7 | Site down, data loss |
| SEV2 | Major feature broken | < 30 min, 24/7 | Checkout failing, login broken |
| SEV3 | Partial degradation | Business hours | Slow reports, minor feature down |
| SEV4 | Minimal impact | Next sprint | Cosmetic bug, edge-case failure |
Detection [B]¶
Sources of incident detection: 1. Monitoring alert fires → pager (PagerDuty, OpsGenie) 2. User/customer report → support ticket 3. Internal report → Slack/chat 4. Automated health check fails
Minimize time-to-detect: - SLO-based burn rate alerts (not just threshold alerts) - Synthetic monitoring / uptime checks - RUM (Real User Monitoring)
→ See Observability: Alerting
Triage [I]¶
First 5 minutes of an incident:
- Acknowledge the alert — stop the pager
- Assess scope — how many users affected? Which services?
- Set severity — based on impact matrix above
- Declare incident — create incident channel (#inc-YYYYMMDD-service)
- Assign roles:
- Incident Commander (IC) — coordinates, drives to resolution
- Technical Lead — digs into the problem
- Comms Lead — updates stakeholders
IC does NOT debug. IC coordinates.
Mitigation vs Resolution [I]¶
| Mitigation | Resolution | |
|---|---|---|
| Goal | Stop user impact | Fix root cause |
| Speed | As fast as possible | Thorough |
| Examples | Rollback, feature flag off, redirect traffic | Fix bug, patch infra, update config |
Prioritize mitigation over root cause analysis during the incident.
Common mitigations:
- kubectl rollout undo deployment/my-service
- Disable feature flag
- Scale up horizontally
- Failover to backup region
- Rate-limit or shed load
- Restore from backup (DBRE → see Backup & Recovery)
Communication During Incidents [I]¶
Internal (Slack/Teams):
- Single incident channel, named consistently
- Status updates every 15-30 min: [UPDATE 14:35] Still investigating. Auth service logs show DB timeouts. Working on mitigation.
- No blame, no speculation
External (Status Page): - Update status page within 5-10 min of SEV1/SEV2 - Be honest but avoid technical jargon - Update every 30 min minimum
Escalation: - Page the on-call expert for the affected component - Loop in management for SEV1 > 30 min
Postmortems [I]¶
A postmortem is a blameless written analysis done after every SEV1/SEV2.
Blameless Culture¶
People make mistakes. The system allowed the mistake to have impact. Fix the system.
"We don't punish people for making mistakes. We fix the conditions that made the mistake possible."
Postmortem Template¶
## Incident: [title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y min
**Severity:** SEV1/SEV2
**Author(s):**
## Summary
[2-3 sentence summary of what happened, impact, and resolution]
## Timeline
| Time | Event |
|------|-------|
| 14:00 | Alert fired for elevated error rate |
| 14:03 | On-call acknowledged |
| 14:15 | Root cause identified (bad deploy) |
| 14:20 | Rollback initiated |
| 14:25 | Service restored |
## Root Cause
[Describe the technical root cause]
## Contributing Factors
- [Factor 1]
- [Factor 2]
## Impact
- Users affected: ~5,000
- Duration: 25 minutes
- Revenue impact: ~$X
## What Went Well
- Detection was fast (3 min from alert to ack)
- Runbook was accurate
## What Went Poorly
- No canary deployment caught this
- Alert was too noisy, delayed response
## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add canary deployment step | @alice | 2024-02-01 |
| Tune alert threshold | @bob | 2024-01-25 |
Incident Metrics to Track [A]¶
- MTTD — Mean Time to Detect
- MTTA — Mean Time to Acknowledge
- MTTM — Mean Time to Mitigate
- MTTR — Mean Time to Resolve
- MTBF — Mean Time Between Failures
- Incident frequency — count per week/month by severity
Track these trends over time. Improving MTTD and MTTM has the highest user impact.
How Real Companies Do It [I]¶
From howtheysre:
- Atlassian — publishes real postmortems at statuspage.io, pioneered public incident communication
- Google — wrote the playbook: blameless postmortems, Incident Command System borrowed from firefighting
- GitHub — ChatOps incident response with Hubot, all actions logged in incident channel automatically
- Slack — built tooling to auto-create incidents from alerts and auto-populate context (recent deploys, affected services)
- PagerDuty — their own incident command system requires IC to not touch a keyboard during SEV1
- Cloudflare — publishes detailed technical postmortems publicly (worth reading for format and depth)
→ See Case Studies for more company-specific patterns.
Related Topics¶
- On-Call & Runbooks
- Observability
- SLOs / SLIs / SLAs
- DBRE: Backup & Recovery
- howtheysre — real incident case studies