Database Postmortem Template¶
Use for: operational failures — outage, failover, replication break, backup failure. Blameless. Use roles not names. → Handbook
| Field | Value |
|---|---|
| ID | PM-YYYY-NNN |
| Date | YYYY-MM-DD |
| Severity | SEV1 / SEV2 / SEV3 |
| DB / Cluster | |
| Owner | |
| Status | Draft / In Review / Approved |
Summary¶
What failed, duration, how detected, how resolved. 3–5 sentences.
Timeline¶
| Time (UTC) | Event |
|---|---|
| Leadup (change, job, config that created the condition) | |
| Fault began | |
| Detected | |
| Mitigation started | |
| Service restored |
Detection gap: ___ Response gap: ___
Five Whys¶
- Why did users/services experience impact? →
- Why? →
- Why? →
- Why? →
- Why? →
Root cause:
Category: Bug / Change / Scale / Architecture / Dependency / Unknown
DB sub-type: Connection handling / Replication / Mount/storage / Migration / Backup / Config drift / Monitoring gap
Impact¶
| Duration | Systems affected | Users affected | Data loss | Replication impacted |
|---|---|---|---|---|
| Yes / No | Yes / No |
Corrective Actions¶
Actionable (verb + outcome) · Specific · Bounded (definition of done). At least one must address root cause.
| # | Category | Action | Owner | Due | Ticket |
|---|---|---|---|---|---|
| 1 | Prevent | ||||
| 2 | Detect | ||||
| 3 | Mitigate |
Investigate · Mitigate this incident · Repair damage · Detect future · Mitigate future · Prevent future
Lessons Learned¶
| Went well | Could have gone better | Got lucky |
|---|---|---|
Approval¶
| Approver (role) | Root cause agreed | Actions agreed | Date |
|---|---|---|---|
| ☐ | ☐ |