Last updated Jun 14, 2026

Postmortem template

Every SEV1 + SEV2 incident produces a postmortem within 5 business days of resolution. SEV1 + SEV2 postmortems are published externally 30 days after resolution. SEV3 + SEV4 internal-only.

The primitive that validates section completeness lives at apps/api/lib/postmortem.ts.

Required sections per severity

Section	SEV1	SEV2	SEV3	SEV4
Summary	✓	✓	✓	✓
Impact	✓	✓	✓
Timeline	✓	✓
Root cause	✓	✓	✓	✓
Contributing factors	✓
What went well	✓
Action items	✓	✓	✓	✓
Lessons learned	✓	✓

Template

# Postmortem — <incident-id>

**Severity:** SEV1
**Started:** 2026-05-15T14:23:00Z
**Detected:** 2026-05-15T14:24:00Z
**Mitigated:** 2026-05-15T14:51:00Z
**Resolved:** 2026-05-15T15:30:00Z
**Author:** <name>
**Reviewers:** <on-call lead>, <bounded-context owner>

## Summary

One paragraph: what broke, who was affected, how long, what we did.

## Impact

- Customers affected: <count, regions, tiers>
- Operations affected: <list>
- Data integrity: <verified intact / partial / breach>
- SLA credits triggered: <yes/no — see SLA engine>

## Timeline

(All times UTC. Synced with Matter-Request-Id traces — link.)

- 14:23 First customer report via support
- 14:24 Alert fires on p99 breach
- 14:28 On-call paged
- 14:35 Mitigation applied: failover to secondary region
- 14:51 Symptoms gone from customer-visible traffic
- 15:30 Root cause patched + verified

## Root cause

What went wrong, mechanically. Be specific — file path, commit, the
exact transition that broke.

## Contributing factors

(SEV1 only.) What conditions made this worse than it should have
been? Where did defense-in-depth fail?

## What went well

(SEV1 only.) What worked? What do we want to keep?

## Action items

Each action item has an owner + a due date + a tracking ID.

- [ ] **AI-1234** [P0, owner: @alice, due: 2026-05-22]: Add invariant
  test catching this transition.
- [ ] **AI-1235** [P1, owner: @bob, due: 2026-06-01]: Update runbook
  to include the missing failover step.

## Lessons learned

What we want every engineer to internalise from this incident.

---

**Publication:**
- [ ] Internal review complete (every reviewer signed off)
- [ ] Action items tracked in linear / GitHub issues
- [ ] External publication scheduled for: <date + 30>

Process

Day 0 — incident resolved. On-call writes a skeleton with timeline + summary.
Day 1-3 — root cause analysis. Postmortem section completes.
Day 5 — internal review with bounded-context owner + at least one engineer not involved in the incident.
Day 35 — external publication if SEV1 / SEV2.

Why these rules

5-day deadline — postmortem quality decays the longer we wait. Fresh memory + active context = better root cause.
30-day publication delay — gives us time to validate the fix before publishing the failure mode.
Required reviewer not involved — challenges blind spots.
Lessons learned section — explicitly captured so we revisit during quarterly retros.

Postmortem template

Canonical template for SEV1 + SEV2 postmortems.

Last updated Jun 14, 2026

Postmortem template

Every SEV1 + SEV2 incident produces a postmortem within 5 business days of resolution. SEV1 + SEV2 postmortems are published externally 30 days after resolution. SEV3 + SEV4 internal-only.

The primitive that validates section completeness lives at apps/api/lib/postmortem.ts.

Required sections per severity

Section	SEV1	SEV2	SEV3	SEV4
Summary	✓	✓	✓	✓
Impact	✓	✓	✓
Timeline	✓	✓
Root cause	✓	✓	✓	✓
Contributing factors	✓
What went well	✓
Action items	✓	✓	✓	✓
Lessons learned	✓	✓

Template

# Postmortem — <incident-id>

**Severity:** SEV1
**Started:** 2026-05-15T14:23:00Z
**Detected:** 2026-05-15T14:24:00Z
**Mitigated:** 2026-05-15T14:51:00Z
**Resolved:** 2026-05-15T15:30:00Z
**Author:** <name>
**Reviewers:** <on-call lead>, <bounded-context owner>

## Summary

One paragraph: what broke, who was affected, how long, what we did.

## Impact

- Customers affected: <count, regions, tiers>
- Operations affected: <list>
- Data integrity: <verified intact / partial / breach>
- SLA credits triggered: <yes/no — see SLA engine>

## Timeline

(All times UTC. Synced with Matter-Request-Id traces — link.)

- 14:23 First customer report via support
- 14:24 Alert fires on p99 breach
- 14:28 On-call paged
- 14:35 Mitigation applied: failover to secondary region
- 14:51 Symptoms gone from customer-visible traffic
- 15:30 Root cause patched + verified

## Root cause

What went wrong, mechanically. Be specific — file path, commit, the
exact transition that broke.

## Contributing factors

(SEV1 only.) What conditions made this worse than it should have
been? Where did defense-in-depth fail?

## What went well

(SEV1 only.) What worked? What do we want to keep?

## Action items

Each action item has an owner + a due date + a tracking ID.

- [ ] **AI-1234** [P0, owner: @alice, due: 2026-05-22]: Add invariant
  test catching this transition.
- [ ] **AI-1235** [P1, owner: @bob, due: 2026-06-01]: Update runbook
  to include the missing failover step.

## Lessons learned

What we want every engineer to internalise from this incident.

---

**Publication:**
- [ ] Internal review complete (every reviewer signed off)
- [ ] Action items tracked in linear / GitHub issues
- [ ] External publication scheduled for: <date + 30>

Process

Day 0 — incident resolved. On-call writes a skeleton with timeline + summary.
Day 1-3 — root cause analysis. Postmortem section completes.
Day 5 — internal review with bounded-context owner + at least one engineer not involved in the incident.
Day 35 — external publication if SEV1 / SEV2.

Why these rules

5-day deadline — postmortem quality decays the longer we wait. Fresh memory + active context = better root cause.
30-day publication delay — gives us time to validate the fix before publishing the failure mode.
Required reviewer not involved — challenges blind spots.
Lessons learned section — explicitly captured so we revisit during quarterly retros.

Postmortem template

Postmortem template

Required sections per severity

Template

Process

Why these rules

On this page

Postmortem template

Postmortem template

Required sections per severity

Template

Process

Why these rules

On this page