Process
Postmortem template
Canonical template for SEV1 + SEV2 postmortems.
Last updated
Postmortem template
Every SEV1 + SEV2 incident produces a postmortem within 5 business days of resolution. SEV1 + SEV2 postmortems are published externally 30 days after resolution. SEV3 + SEV4 internal-only.
The primitive that validates section completeness lives at
apps/api/lib/postmortem.ts.
Required sections per severity
| Section | SEV1 | SEV2 | SEV3 | SEV4 |
|---|---|---|---|---|
| Summary | ✓ | ✓ | ✓ | ✓ |
| Impact | ✓ | ✓ | ✓ | |
| Timeline | ✓ | ✓ | ||
| Root cause | ✓ | ✓ | ✓ | ✓ |
| Contributing factors | ✓ | |||
| What went well | ✓ | |||
| Action items | ✓ | ✓ | ✓ | ✓ |
| Lessons learned | ✓ | ✓ |
Template
# Postmortem — <incident-id>
**Severity:** SEV1
**Started:** 2026-05-15T14:23:00Z
**Detected:** 2026-05-15T14:24:00Z
**Mitigated:** 2026-05-15T14:51:00Z
**Resolved:** 2026-05-15T15:30:00Z
**Author:** <name>
**Reviewers:** <on-call lead>, <bounded-context owner>
## Summary
One paragraph: what broke, who was affected, how long, what we did.
## Impact
- Customers affected: <count, regions, tiers>
- Operations affected: <list>
- Data integrity: <verified intact / partial / breach>
- SLA credits triggered: <yes/no — see SLA engine>
## Timeline
(All times UTC. Synced with Matter-Request-Id traces — link.)
- 14:23 First customer report via support
- 14:24 Alert fires on p99 breach
- 14:28 On-call paged
- 14:35 Mitigation applied: failover to secondary region
- 14:51 Symptoms gone from customer-visible traffic
- 15:30 Root cause patched + verified
## Root cause
What went wrong, mechanically. Be specific — file path, commit, the
exact transition that broke.
## Contributing factors
(SEV1 only.) What conditions made this worse than it should have
been? Where did defense-in-depth fail?
## What went well
(SEV1 only.) What worked? What do we want to keep?
## Action items
Each action item has an owner + a due date + a tracking ID.
- [ ] **AI-1234** [P0, owner: @alice, due: 2026-05-22]: Add invariant
test catching this transition.
- [ ] **AI-1235** [P1, owner: @bob, due: 2026-06-01]: Update runbook
to include the missing failover step.
## Lessons learned
What we want every engineer to internalise from this incident.
---
**Publication:**
- [ ] Internal review complete (every reviewer signed off)
- [ ] Action items tracked in linear / GitHub issues
- [ ] External publication scheduled for: <date + 30>Process
- Day 0 — incident resolved. On-call writes a skeleton with timeline + summary.
- Day 1-3 — root cause analysis. Postmortem section completes.
- Day 5 — internal review with bounded-context owner + at least one engineer not involved in the incident.
- Day 35 — external publication if SEV1 / SEV2.
Why these rules
- 5-day deadline — postmortem quality decays the longer we wait. Fresh memory + active context = better root cause.
- 30-day publication delay — gives us time to validate the fix before publishing the failure mode.
- Required reviewer not involved — challenges blind spots.
- Lessons learned section — explicitly captured so we revisit during quarterly retros.