Runbooks
Split brain
SEV1 runbook for two regions accepting writes for the same org.
Last updated
Split brain (SEV1)
Triggered by: dual-region write detection. After a region failover, either the old primary remains writable OR the new primary's promotion didn't fully propagate. Both regions accept conflicting writes for the same org.
On-call: Platform (primary), Security (secondary). Pager: SEV1. Estimated MTTR: 1-4 hours (data reconciliation is the hard part).
Stop-the-bleed (immediately)
- Freeze writes on both regions for the affected org:
matter ops freeze-org --org-id <id> --reason "split brain investigation" - Page Postgres operator — replication state may be diverged.
- Snapshot both regions' state for forensics:
matter ops snapshot-region-pair --org-id <id> --target /forensics/<incident>/
Diagnose
Compare the two regions' state for the affected org:
matter ops region-diff --org-id <id> --regions us-east,eu-centralFor each row that differs:
- Identify which region's write is "winner" by timestamp + audit chain integrity.
- The loser's audit-chain anchor in Rekor (P0.C4) doesn't help here — both regions may have anchored.
Recover
This is the hardest recovery in the runbook library. Sequence:
- Designate canonical region. Typically the original primary if it's still healthy; otherwise the failover target.
- Quarantine the divergent region's writes under the
quarantined_writestable. Don't delete — these are real customer actions. - Replay quarantined writes through the canonical region's event log, in original commit-time order. Conflicts go to a manual-resolution queue.
- Customer-side notification for any write whose result materially changed during reconciliation.
matter ops split-brain-reconcile --org-id <id> --canonical-region <region>Validate
matter ops verify-region-sync --org-id <id>Expected:
- Both regions converged on identical state.
- Audit chain has split-brain markers for the affected window.
- No quarantined writes remain unresolved.
Communicate
- Customer email within 1 hour (SEV1).
- Status page update.
- Per-region regulator notification if data residency was breached.
- Postmortem within 5 days, externally published.
Post-recovery action items
- Improve split-brain detection latency.
- Property test the failover procedure.
- Verify Postgres replication-slot guard is tight.