Incident severity matrix

Last updated Jun 14, 2026

Every operational incident at Matter is classified into one of four severities. The severity drives:

Paging behaviour. Who wakes up; when secondary and manager are paged.
Public communications cadence. Status page entries; email to active users.
Internal communications cadence. Incident channel updates; engineering-lead awareness.
Post-mortem treatment. SEV1 and SEV2 are published externally with 30-day delay; SEV3 and SEV4 are internal-only.

The same matrix the customer reads here is the matrix on-call uses live.

The four severities

SEV1 — Critical

The highest severity. Any of the following:

Full API outage. /v1/* returns 5xx for ≥ 5 % of requests across all customers, all regions, for ≥ 5 consecutive minutes.
Irrecoverable data loss. Resource state is lost without compensating audit trail. A successful write that subsequently disappears from reads.
Audit-chain integrity failure. The append-only AuditEntry chain has a verified broken link, OR the Sigstore Rekor witness rejects a daily anchor, OR a tamper-detection drill flags a real entry as forged.
Cryptographic primitive failure. A KMS rotation drill fails to recover; webhook signature verification rejects valid signatures globally; field-level decryption fails for an entire tenant.
Genesis ceremony failure. A new entity's incorporator-protocol chain cannot be verified.
Multi-region failure. Both US and EU regions degraded simultaneously.

Response:

Page primary on-call immediately.
Page secondary on-call at 5 minutes if unacknowledged.
Page manager on-call immediately on declaration.
Page engineering lead on declaration.
Status page incident declared within 15 minutes.
Customer email within 30 minutes (to active users in the affected scope).
Public comms cadence: every 30 minutes until resolution.
Internal comms: real-time in #incident-<yyyymmdd>-<short-name>.
Post-mortem published within 14 days, 30 days after resolution.

SEV2 — High

Significant impact but not total. Any of the following:

Regional outage. US or EU region degraded for ≥ 10 consecutive minutes; the other region serving normally.
Major feature unavailable. Any composite atomic flow (formation_packet, close_package, mfn_cascade, dissolve) returns 5xx for ≥ 5 % of attempts for ≥ 10 minutes.
Compliance breach. PII leaked to an observability sink; redaction codegen drift causes a real customer's data to appear unredacted in Sentry / Logtail / traffic capture.
Webhook delivery degraded. Webhook delivery p99 receiver-side exceeds 2× SLO ceiling for ≥ 30 minutes, OR ≥ 5 % delivery attempts dead-lettered.
CDC lag breached. Read-model lag p99 exceeds 5× SLO ceiling for ≥ 15 minutes.
Provider outage. A primary filing provider is unavailable AND the circuit-breaker has not yet routed to a backup.
Tenant isolation breach. A red-team test or customer report confirms cross-tenant data exposure — even one row.

Response:

Page primary on-call immediately.
Page secondary on-call at 10 minutes if unacknowledged.
Page manager on-call on declaration.
Status page incident declared within 30 minutes.
Customer email within 60 minutes (to active users in the affected scope).
Public comms cadence: every 60 minutes until resolution.
Internal comms: real-time in #incident-<yyyymmdd>-<short-name>.
Post-mortem published within 21 days, 30 days after resolution.

SEV3 — Moderate

Degraded performance with no broad customer impact. Any of the following:

SLO breach without symptom. p99 latency on a single operation class exceeds budget for ≥ 30 minutes, but error rate within budget.
Single endpoint failing. One operation returns 5xx for ≥ 1 % of requests for ≥ 15 minutes; the rest of the API is healthy.
Provider degraded but routing. A primary provider is slow but the circuit-breaker hasn't opened.
Cache miss-storm. A cache layer's hit ratio drops below 50 % of target for ≥ 30 minutes.
Webhook queue backpressure. A single endpoint's queue is in slowdown but not disabled.
Anomaly detection fired. A token-usage anomaly triggered but is not yet confirmed malicious.

Response:

Ticket created in on-call queue. No page outside business hours unless escalation.
Status page: internal-only entry; no customer-visible update.
Internal comms cadence: hourly in #incident-<yyyymmdd>-<short-name>.
Post-mortem: internal-only at engineering review.

SEV4 — Minor

Anomaly without customer-visible impact. Any of the following:

Internal alert fired. Leading indicator threshold crossed but SLO still met.
Single-customer issue. A single customer's integration is misconfigured; no Matter-side fix required.
Documentation or copy issue. A typo in an error message, a stale docs link, a missing example.
Cosmetic dashboard issue. Internal dashboard rendering anomaly.

Response:

Ticket in the standard backlog. No page.
No status page entry.
No comms beyond ticket resolution.
No post-mortem unless trend across multiple SEV4s suggests a pattern.

Severity changes mid-incident

Severities are revised in real time as facts arrive. Common revisions:

Up: what looked like SEV3 turns out to span multiple regions → SEV2 → SEV1 if confirmed full-outage.
Down: a SEV2 narrows to one customer → SEV3.

A severity revision triggers the new severity's comms cadence retroactively (e.g., a SEV3 promoted to SEV2 after 45 minutes still owes a customer email within the next 15 minutes).

Two reasons. First, our enterprise customers want to know our internal classification matches what they see — no SEV2 quietly being treated as SEV3 to avoid the status-page hit. Second, the people most affected by an outage deserve to know the response intensity matches the impact.

Internal on-call has the same matrix bookmarked. It is the same document.

The four severities

SEV1 — Critical

The highest severity. Any of the following:

Full API outage. /v1/* returns 5xx for ≥ 5 % of requests across all customers, all regions, for ≥ 5 consecutive minutes.
Irrecoverable data loss. Resource state is lost without compensating audit trail. A successful write that subsequently disappears from reads.
Audit-chain integrity failure. The append-only AuditEntry chain has a verified broken link, OR the Sigstore Rekor witness rejects a daily anchor, OR a tamper-detection drill flags a real entry as forged.
Cryptographic primitive failure. A KMS rotation drill fails to recover; webhook signature verification rejects valid signatures globally; field-level decryption fails for an entire tenant.
Genesis ceremony failure. A new entity's incorporator-protocol chain cannot be verified.
Multi-region failure. Both US and EU regions degraded simultaneously.

Response:

Page primary on-call immediately.
Page secondary on-call at 5 minutes if unacknowledged.
Page manager on-call immediately on declaration.
Page engineering lead on declaration.
Status page incident declared within 15 minutes.
Customer email within 30 minutes (to active users in the affected scope).
Public comms cadence: every 30 minutes until resolution.
Internal comms: real-time in #incident-<yyyymmdd>-<short-name>.
Post-mortem published within 14 days, 30 days after resolution.

SEV2 — High

Significant impact but not total. Any of the following:

Regional outage. US or EU region degraded for ≥ 10 consecutive minutes; the other region serving normally.
Major feature unavailable. Any composite atomic flow (formation_packet, close_package, mfn_cascade, dissolve) returns 5xx for ≥ 5 % of attempts for ≥ 10 minutes.
Compliance breach. PII leaked to an observability sink; redaction codegen drift causes a real customer's data to appear unredacted in Sentry / Logtail / traffic capture.
Webhook delivery degraded. Webhook delivery p99 receiver-side exceeds 2× SLO ceiling for ≥ 30 minutes, OR ≥ 5 % delivery attempts dead-lettered.
CDC lag breached. Read-model lag p99 exceeds 5× SLO ceiling for ≥ 15 minutes.
Provider outage. A primary filing provider is unavailable AND the circuit-breaker has not yet routed to a backup.
Tenant isolation breach. A red-team test or customer report confirms cross-tenant data exposure — even one row.

Response:

Page primary on-call immediately.
Page secondary on-call at 10 minutes if unacknowledged.
Page manager on-call on declaration.
Status page incident declared within 30 minutes.
Customer email within 60 minutes (to active users in the affected scope).
Public comms cadence: every 60 minutes until resolution.
Internal comms: real-time in #incident-<yyyymmdd>-<short-name>.
Post-mortem published within 21 days, 30 days after resolution.

SEV3 — Moderate

Degraded performance with no broad customer impact. Any of the following:

SLO breach without symptom. p99 latency on a single operation class exceeds budget for ≥ 30 minutes, but error rate within budget.
Single endpoint failing. One operation returns 5xx for ≥ 1 % of requests for ≥ 15 minutes; the rest of the API is healthy.
Provider degraded but routing. A primary provider is slow but the circuit-breaker hasn't opened.
Cache miss-storm. A cache layer's hit ratio drops below 50 % of target for ≥ 30 minutes.
Webhook queue backpressure. A single endpoint's queue is in slowdown but not disabled.
Anomaly detection fired. A token-usage anomaly triggered but is not yet confirmed malicious.

Response:

Ticket created in on-call queue. No page outside business hours unless escalation.
Status page: internal-only entry; no customer-visible update.
Internal comms cadence: hourly in #incident-<yyyymmdd>-<short-name>.
Post-mortem: internal-only at engineering review.

SEV4 — Minor

Anomaly without customer-visible impact. Any of the following:

Internal alert fired. Leading indicator threshold crossed but SLO still met.
Single-customer issue. A single customer's integration is misconfigured; no Matter-side fix required.
Documentation or copy issue. A typo in an error message, a stale docs link, a missing example.
Cosmetic dashboard issue. Internal dashboard rendering anomaly.

Response:

Ticket in the standard backlog. No page.
No status page entry.
No comms beyond ticket resolution.
No post-mortem unless trend across multiple SEV4s suggests a pattern.

Severity changes mid-incident

Severities are revised in real time as facts arrive. Common revisions:

Up: what looked like SEV3 turns out to span multiple regions → SEV2 → SEV1 if confirmed full-outage.
Down: a SEV2 narrows to one customer → SEV3.

A severity revision triggers the new severity's comms cadence retroactively (e.g., a SEV3 promoted to SEV2 after 45 minutes still owes a customer email within the next 15 minutes).

Why publish this externally

Internal on-call has the same matrix bookmarked. It is the same document.

Incident severity matrix

The four severities

SEV1 — Critical

SEV2 — High

SEV3 — Moderate

SEV4 — Minor

Severity changes mid-incident

Why publish this externally

See also

On this page

Incident severity matrix

The four severities

SEV1 — Critical

SEV2 — High

SEV3 — Moderate

SEV4 — Minor

Severity changes mid-incident

Why publish this externally

See also

On this page