Runbooks
Incident severity matrix
How Matter classifies operational incidents. Four severities — SEV1 through SEV4 — drive paging, comms cadence, and status-page surfacing. Published for enterprise transparency; same matrix internal on-call uses.
Last updated
Every operational incident at Matter is classified into one of four severities. The severity drives:
- Paging behaviour. Who wakes up; when secondary and manager are paged.
- Public communications cadence. Status page entries; email to active users.
- Internal communications cadence. Incident channel updates; engineering-lead awareness.
- Post-mortem treatment. SEV1 and SEV2 are published externally with 30-day delay; SEV3 and SEV4 are internal-only.
The same matrix the customer reads here is the matrix on-call uses live.
The four severities
SEV1 — Critical
The highest severity. Any of the following:
- Full API outage.
/v1/*returns 5xx for ≥ 5 % of requests across all customers, all regions, for ≥ 5 consecutive minutes. - Irrecoverable data loss. Resource state is lost without compensating audit trail. A successful write that subsequently disappears from reads.
- Audit-chain integrity failure. The append-only
AuditEntrychain has a verified broken link, OR the Sigstore Rekor witness rejects a daily anchor, OR a tamper-detection drill flags a real entry as forged. - Cryptographic primitive failure. A KMS rotation drill fails to recover; webhook signature verification rejects valid signatures globally; field-level decryption fails for an entire tenant.
- Genesis ceremony failure. A new entity's incorporator-protocol chain cannot be verified.
- Multi-region failure. Both US and EU regions degraded simultaneously.
Response:
- Page primary on-call immediately.
- Page secondary on-call at 5 minutes if unacknowledged.
- Page manager on-call immediately on declaration.
- Page engineering lead on declaration.
- Status page incident declared within 15 minutes.
- Customer email within 30 minutes (to active users in the affected scope).
- Public comms cadence: every 30 minutes until resolution.
- Internal comms: real-time in
#incident-<yyyymmdd>-<short-name>. - Post-mortem published within 14 days, 30 days after resolution.
SEV2 — High
Significant impact but not total. Any of the following:
- Regional outage. US or EU region degraded for ≥ 10 consecutive minutes; the other region serving normally.
- Major feature unavailable. Any composite atomic flow (
formation_packet,close_package,mfn_cascade,dissolve) returns 5xx for ≥ 5 % of attempts for ≥ 10 minutes. - Compliance breach. PII leaked to an observability sink; redaction codegen drift causes a real customer's data to appear unredacted in Sentry / Logtail / traffic capture.
- Webhook delivery degraded. Webhook delivery p99 receiver-side exceeds 2× SLO ceiling for ≥ 30 minutes, OR ≥ 5 % delivery attempts dead-lettered.
- CDC lag breached. Read-model lag p99 exceeds 5× SLO ceiling for ≥ 15 minutes.
- Provider outage. A primary filing provider is unavailable AND the circuit-breaker has not yet routed to a backup.
- Tenant isolation breach. A red-team test or customer report confirms cross-tenant data exposure — even one row.
Response:
- Page primary on-call immediately.
- Page secondary on-call at 10 minutes if unacknowledged.
- Page manager on-call on declaration.
- Status page incident declared within 30 minutes.
- Customer email within 60 minutes (to active users in the affected scope).
- Public comms cadence: every 60 minutes until resolution.
- Internal comms: real-time in
#incident-<yyyymmdd>-<short-name>. - Post-mortem published within 21 days, 30 days after resolution.
SEV3 — Moderate
Degraded performance with no broad customer impact. Any of the following:
- SLO breach without symptom. p99 latency on a single operation class exceeds budget for ≥ 30 minutes, but error rate within budget.
- Single endpoint failing. One operation returns 5xx for ≥ 1 % of requests for ≥ 15 minutes; the rest of the API is healthy.
- Provider degraded but routing. A primary provider is slow but the circuit-breaker hasn't opened.
- Cache miss-storm. A cache layer's hit ratio drops below 50 % of target for ≥ 30 minutes.
- Webhook queue backpressure. A single endpoint's queue is in slowdown but not disabled.
- Anomaly detection fired. A token-usage anomaly triggered but is not yet confirmed malicious.
Response:
- Ticket created in on-call queue. No page outside business hours unless escalation.
- Status page: internal-only entry; no customer-visible update.
- Internal comms cadence: hourly in
#incident-<yyyymmdd>-<short-name>. - Post-mortem: internal-only at engineering review.
SEV4 — Minor
Anomaly without customer-visible impact. Any of the following:
- Internal alert fired. Leading indicator threshold crossed but SLO still met.
- Single-customer issue. A single customer's integration is misconfigured; no Matter-side fix required.
- Documentation or copy issue. A typo in an error message, a stale docs link, a missing example.
- Cosmetic dashboard issue. Internal dashboard rendering anomaly.
Response:
- Ticket in the standard backlog. No page.
- No status page entry.
- No comms beyond ticket resolution.
- No post-mortem unless trend across multiple SEV4s suggests a pattern.
Severity changes mid-incident
Severities are revised in real time as facts arrive. Common revisions:
- Up: what looked like SEV3 turns out to span multiple regions → SEV2 → SEV1 if confirmed full-outage.
- Down: a SEV2 narrows to one customer → SEV3.
A severity revision triggers the new severity's comms cadence retroactively (e.g., a SEV3 promoted to SEV2 after 45 minutes still owes a customer email within the next 15 minutes).
Why publish this externally
Two reasons. First, our enterprise customers want to know our internal classification matches what they see — no SEV2 quietly being treated as SEV3 to avoid the status-page hit. Second, the people most affected by an outage deserve to know the response intensity matches the impact.
Internal on-call has the same matrix bookmarked. It is the same document.
See also
- Incident communications template — exact wording for status page + customer email per severity.
- Customer SLA — what credits an outage earns and how it is measured.
- Matter API SLOs — internal performance budgets that feed the matrix.