Incident Management¶

Incident Lifecycle `[B]`¶

Detection → Triage → Mitigation → Resolution → Postmortem

Severity Levels `[B]`¶

Define these clearly so everyone knows the response expectations:

Severity	User Impact	Response Time	Example
SEV1	Complete outage	Immediate, 24/7	Site down, data loss
SEV2	Major feature broken	< 30 min, 24/7	Checkout failing, login broken
SEV3	Partial degradation	Business hours	Slow reports, minor feature down
SEV4	Minimal impact	Next sprint	Cosmetic bug, edge-case failure

Detection `[B]`¶

Sources of incident detection: 1. Monitoring alert fires → pager (PagerDuty, OpsGenie) 2. User/customer report → support ticket 3. Internal report → Slack/chat 4. Automated health check fails

Minimize time-to-detect: - SLO-based burn rate alerts (not just threshold alerts) - Synthetic monitoring / uptime checks - RUM (Real User Monitoring)

→ See Observability: Alerting

Triage `[I]`¶

First 5 minutes of an incident:

Acknowledge the alert — stop the pager
Assess scope — how many users affected? Which services?
Set severity — based on impact matrix above
Declare incident — create incident channel (#inc-YYYYMMDD-service)
Assign roles:
Incident Commander (IC) — coordinates, drives to resolution
Technical Lead — digs into the problem
Comms Lead — updates stakeholders

IC does NOT debug. IC coordinates.

Mitigation vs Resolution `[I]`¶

	Mitigation	Resolution
Goal	Stop user impact	Fix root cause
Speed	As fast as possible	Thorough
Examples	Rollback, feature flag off, redirect traffic	Fix bug, patch infra, update config

Prioritize mitigation over root cause analysis during the incident.

Common mitigations: - kubectl rollout undo deployment/my-service - Disable feature flag - Scale up horizontally - Failover to backup region - Rate-limit or shed load - Restore from backup (DBRE → see Backup & Recovery)

Communication During Incidents `[I]`¶

Internal (Slack/Teams): - Single incident channel, named consistently - Status updates every 15-30 min: [UPDATE 14:35] Still investigating. Auth service logs show DB timeouts. Working on mitigation. - No blame, no speculation

External (Status Page): - Update status page within 5-10 min of SEV1/SEV2 - Be honest but avoid technical jargon - Update every 30 min minimum

Escalation: - Page the on-call expert for the affected component - Loop in management for SEV1 > 30 min

Postmortems `[I]`¶

A postmortem is a blameless written analysis done after every SEV1/SEV2.

Blameless Culture¶

People make mistakes. The system allowed the mistake to have impact. Fix the system.

"We don't punish people for making mistakes. We fix the conditions that made the mistake possible."

Postmortem Template¶

## Incident: [title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y min
**Severity:** SEV1/SEV2
**Author(s):**

## Summary
[2-3 sentence summary of what happened, impact, and resolution]

## Timeline
| Time | Event |
|------|-------|
| 14:00 | Alert fired for elevated error rate |
| 14:03 | On-call acknowledged |
| 14:15 | Root cause identified (bad deploy) |
| 14:20 | Rollback initiated |
| 14:25 | Service restored |

## Root Cause
[Describe the technical root cause]

## Contributing Factors
- [Factor 1]
- [Factor 2]

## Impact
- Users affected: ~5,000
- Duration: 25 minutes
- Revenue impact: ~$X

## What Went Well
- Detection was fast (3 min from alert to ack)
- Runbook was accurate

## What Went Poorly
- No canary deployment caught this
- Alert was too noisy, delayed response

## Action Items
| Action | Owner | Due |
|--------|-------|-----|
| Add canary deployment step | @alice | 2024-02-01 |
| Tune alert threshold | @bob | 2024-01-25 |

Incident Metrics to Track `[A]`¶

MTTD — Mean Time to Detect
MTTA — Mean Time to Acknowledge
MTTM — Mean Time to Mitigate
MTTR — Mean Time to Resolve
MTBF — Mean Time Between Failures
Incident frequency — count per week/month by severity

Track these trends over time. Improving MTTD and MTTM has the highest user impact.

How Real Companies Do It `[I]`¶

From howtheysre:

Atlassian — publishes real postmortems at statuspage.io, pioneered public incident communication
Google — wrote the playbook: blameless postmortems, Incident Command System borrowed from firefighting
GitHub — ChatOps incident response with Hubot, all actions logged in incident channel automatically
Slack — built tooling to auto-create incidents from alerts and auto-populate context (recent deploys, affected services)
PagerDuty — their own incident command system requires IC to not touch a keyboard during SEV1
Cloudflare — publishes detailed technical postmortems publicly (worth reading for format and depth)

→ See Case Studies for more company-specific patterns.

Incident Management¶

Incident Lifecycle [B]¶

Severity Levels [B]¶

Detection [B]¶

Triage [I]¶

Mitigation vs Resolution [I]¶

Communication During Incidents [I]¶

Postmortems [I]¶