Runbooks
Audit chain recovery
SEV1 runbook for a detected audit-chain break.
Last updated
Audit chain recovery (SEV1)
Triggered by: data-integrity-check cron (P0.G19) reports a broken
link in the per-(org, mode) audit chain, OR a customer's verification
endpoint returns "chain inconsistent."
On-call: Security (primary), Platform (secondary). Pager: SEV1, page within 60 seconds. Estimated MTTR: 30-90 minutes for synthetic anomaly, 4+ hours for real corruption.
Stop-the-bleed (5 minutes)
- Freeze writes to the affected (org, mode):
matter ops freeze-org --org-id <org_id> --mode <mode> --reason "audit-chain-break investigation" - Pin a snapshot of the AuditEntry table for forensics:
matter ops snapshot-audit --org-id <org_id> --mode <mode> --target /forensics/<incident_id>/ - Cross-check Rekor anchor (P0.C4): every AuditEntry row from the last 24 hours should have a corresponding Rekor transparency log entry. Compare digests.
Diagnose
Three failure modes:
Mode A: Stale read
The CDC consumer lagged; the verification endpoint read pre-write. Mitigation: bounce the CDC consumer; rerun the check. No data recovery needed.
Mode B: Genuine break (extremely rare)
A row was inserted out of band. Possible causes:
- Manual SQL outside
packages/database/src/append-only.ts. - Replica corruption.
- Storage-tier migration (P11.13) didn't preserve the chain.
Action: locate the bad row. Compare adjacent prevHash vs current hash. Use forensics snapshot — DO NOT mutate live data yet.
Mode C: Adversarial tampering
The chain was rewritten by someone with write access.
Action: immediately rotate the audit chain pepper (P0.C5) + invalidate sessions + page Security leadership. This is a breach-level event.
Recover
For Modes A + B, recover from the canonical event log
(P0.E5 event-sourcing-as-canonical):
matter ops rebuild-audit-chain \
--org-id <org_id> \
--mode <mode> \
--from-sequence <break_point> \
--verify-against-rekorThis walks the event log forward from the break point, recomputes hashes, and writes a new chain segment to the WORM bucket.
For Mode C, restore from the most recent verified Rekor-anchored
state + replay events. The replayFromSnapshot path in
apps/api/lib/cqrs-read-model.ts is the canonical implementation.
Validate
Run the canonical post-recovery suite:
matter ops verify-audit-chain --org-id <org_id> --mode <mode> --since 24hExpected outputs:
- Every AuditEntry verifies against the recomputed prevHash.
- Every entry has a Rekor anchor within SLA window.
- The data-integrity-check cron's next run reports clean.
Communicate
Per severity-matrix.mdx, SEV1
requires:
- Status-page update within 15 minutes of declaration.
- Customer email to the affected org within 1 hour.
- Internal #incidents channel updated every 30 minutes.
- Postmortem within 5 days, externally published within 35 days.
Post-recovery action items
Standard template:
- Identify which control allowed the break.
- Add a regression test to
apps/api/__tests__/audit.test.ts. - Run the chaos drill on this scenario within 2 weeks to verify the regression doesn't recur.