Runbooks
Region failover
SEV1 runbook for promoting a secondary region to primary.
Last updated
Region failover (SEV1)
Triggered by: primary region health probe failed for ≥ 90 seconds, OR planned maintenance, OR observed customer impact in primary region above the failover threshold.
On-call: Platform (primary), Security (secondary). Estimated MTTR: 15-30 minutes for the cutover; 1-4 hours for catch-up + customer comms.
Stop-the-bleed
- Verify primary really is down: check external probes (us-east, eu-central, apac).
- Verify secondary really is healthy: check the SAME probes against the secondary region.
- Confirm CDC lag on secondary is < SLA window (1s p99 per
cdc-lag-recovery.mdx).
Execute cutover
Step 1 — Stop writes to primary
matter ops region-failover-prepare --primary <region> --reason "<incident_id>"This sets the WAF to 503 all mutations against the primary region. Reads continue from primary cache (slightly stale, < CDC SLA).
Step 2 — Drain in-flight writes
Wait for the in-flight-request counter to hit zero in the primary region. Typically 30-60 seconds.
matter ops region-watch-drain --region <region> --max-wait 5mStep 3 — Promote secondary
matter ops region-promote --new-primary <region> --confirmThis:
- Switches the DNS pointer (low-TTL record).
- Promotes the Postgres replica to primary (Neon endpoint flip).
- Re-anchors Rekor witness in the new region.
- Updates the per-region KMS pointer.
Step 4 — Open writes on new primary
matter ops region-open-writes --primary <region>WAF resumes accepting mutations. From this point, the new primary is canonical.
Validate
matter ops verify-region-state --primary <region>Expected:
- New primary accepting both reads + writes.
- Old primary still reachable read-only OR fully offline (depending on cause).
- CDC catch-up on the new primary's old replica: ≤ 30s.
- Audit chain extended with
region_failoverrow. - Probes from all three regions green.
Communicate
SEV1 protocol:
- Status page updated within 15 minutes of declaration with the region failover in progress.
- Customer email sent at completion. Region-tagged customers (e.g. EU-pinned data residency) emailed first.
- Internal #incidents updated every 10 minutes during cutover.
Reverse failover (later)
Once the original primary is healthy again, schedule a controlled reverse failover during off-peak. Same procedure with regions swapped. Don't auto-fail-back during an incident — let the new primary stabilise.
TLA+ invariants verified
The split-brain protection here is one of:
- The DNS TTL is short enough (60s) that lingering writes against the old primary are rare.
- The old primary's connection-pool is forcibly closed (per
apps/api/lib/replication-slot-monitor.ts). - Postgres logical replication rejects writes to demoted replica.
If split-brain is detected (two regions accepting writes for the
same org), follow the dedicated split-brain.mdx runbook.