Last updated Jun 14, 2026

Region failover (SEV1)

Triggered by: primary region health probe failed for ≥ 90 seconds, OR planned maintenance, OR observed customer impact in primary region above the failover threshold.

On-call: Platform (primary), Security (secondary). Estimated MTTR: 15-30 minutes for the cutover; 1-4 hours for catch-up + customer comms.

Stop-the-bleed

Verify primary really is down: check external probes (us-east, eu-central, apac).
Verify secondary really is healthy: check the SAME probes against the secondary region.
Confirm CDC lag on secondary is < SLA window (1s p99 per cdc-lag-recovery.mdx).

Execute cutover

Step 1 — Stop writes to primary

matter ops region-failover-prepare --primary <region> --reason "<incident_id>"

This sets the WAF to 503 all mutations against the primary region. Reads continue from primary cache (slightly stale, < CDC SLA).

Step 2 — Drain in-flight writes

Wait for the in-flight-request counter to hit zero in the primary region. Typically 30-60 seconds.

matter ops region-watch-drain --region <region> --max-wait 5m

Step 3 — Promote secondary

matter ops region-promote --new-primary <region> --confirm

This:

Switches the DNS pointer (low-TTL record).
Promotes the Postgres replica to primary (Neon endpoint flip).
Re-anchors Rekor witness in the new region.
Updates the per-region KMS pointer.

Step 4 — Open writes on new primary

matter ops region-open-writes --primary <region>

WAF resumes accepting mutations. From this point, the new primary is canonical.

Validate

matter ops verify-region-state --primary <region>

Expected:

New primary accepting both reads + writes.
Old primary still reachable read-only OR fully offline (depending on cause).
CDC catch-up on the new primary's old replica: ≤ 30s.
Audit chain extended with region_failover row.
Probes from all three regions green.

Communicate

SEV1 protocol:

Status page updated within 15 minutes of declaration with the region failover in progress.
Customer email sent at completion. Region-tagged customers (e.g. EU-pinned data residency) emailed first.
Internal #incidents updated every 10 minutes during cutover.

Once the original primary is healthy again, schedule a controlled reverse failover during off-peak. Same procedure with regions swapped. Don't auto-fail-back during an incident — let the new primary stabilise.

TLA+ invariants verified

The split-brain protection here is one of:

The DNS TTL is short enough (60s) that lingering writes against the old primary are rare.
The old primary's connection-pool is forcibly closed (per apps/api/lib/replication-slot-monitor.ts).
Postgres logical replication rejects writes to demoted replica.

If split-brain is detected (two regions accepting writes for the same org), follow the dedicated split-brain.mdx runbook.

Last updated Jun 14, 2026

Region failover (SEV1)

Triggered by: primary region health probe failed for ≥ 90 seconds, OR planned maintenance, OR observed customer impact in primary region above the failover threshold.

On-call: Platform (primary), Security (secondary). Estimated MTTR: 15-30 minutes for the cutover; 1-4 hours for catch-up + customer comms.

Stop-the-bleed

Verify primary really is down: check external probes (us-east, eu-central, apac).
Verify secondary really is healthy: check the SAME probes against the secondary region.
Confirm CDC lag on secondary is < SLA window (1s p99 per cdc-lag-recovery.mdx).

Execute cutover

Step 1 — Stop writes to primary

matter ops region-failover-prepare --primary <region> --reason "<incident_id>"

This sets the WAF to 503 all mutations against the primary region. Reads continue from primary cache (slightly stale, < CDC SLA).

Step 2 — Drain in-flight writes

Wait for the in-flight-request counter to hit zero in the primary region. Typically 30-60 seconds.

matter ops region-watch-drain --region <region> --max-wait 5m

Step 3 — Promote secondary

matter ops region-promote --new-primary <region> --confirm

This:

Switches the DNS pointer (low-TTL record).
Promotes the Postgres replica to primary (Neon endpoint flip).
Re-anchors Rekor witness in the new region.
Updates the per-region KMS pointer.

Step 4 — Open writes on new primary

matter ops region-open-writes --primary <region>

WAF resumes accepting mutations. From this point, the new primary is canonical.

Validate

matter ops verify-region-state --primary <region>

Expected:

New primary accepting both reads + writes.
Old primary still reachable read-only OR fully offline (depending on cause).
CDC catch-up on the new primary's old replica: ≤ 30s.
Audit chain extended with region_failover row.
Probes from all three regions green.

Communicate

SEV1 protocol:

Status page updated within 15 minutes of declaration with the region failover in progress.
Customer email sent at completion. Region-tagged customers (e.g. EU-pinned data residency) emailed first.
Internal #incidents updated every 10 minutes during cutover.

Reverse failover (later)

TLA+ invariants verified

The split-brain protection here is one of:

The DNS TTL is short enough (60s) that lingering writes against the old primary are rare.
The old primary's connection-pool is forcibly closed (per apps/api/lib/replication-slot-monitor.ts).
Postgres logical replication rejects writes to demoted replica.

If split-brain is detected (two regions accepting writes for the same org), follow the dedicated split-brain.mdx runbook.

Region failover

Region failover (SEV1)

Stop-the-bleed

Execute cutover

Step 1 — Stop writes to primary

Step 2 — Drain in-flight writes

Step 3 — Promote secondary

Step 4 — Open writes on new primary

Validate

Communicate

Reverse failover (later)

TLA+ invariants verified

On this page

Region failover

Region failover (SEV1)

Stop-the-bleed

Execute cutover

Step 1 — Stop writes to primary

Step 2 — Drain in-flight writes

Step 3 — Promote secondary

Step 4 — Open writes on new primary

Validate

Communicate

Reverse failover (later)

TLA+ invariants verified

On this page