Runbooks
KMS rotation recovery
SEV1 runbook for a KMS rotation that fails mid-saga.
Last updated
KMS rotation recovery (SEV1)
Triggered by: the rotation saga at
packages/crypto/rotation.tla
fails after step 3 (switch_kek_pointer). Pre-switch failures
auto-rollback via the PreSwitchFailure branch — no recovery
needed.
On-call: Security (primary). Pager: SEV1. Estimated MTTR: 1-4 hours.
What failed where
The rotation saga has three post-switch steps:
- Step 4:
wait_inflight_drain— connections holding old-KEK reference must finish. - Step 5:
retire_old_kek— mark old KEK inactive.
If step 4 hangs, in-flight references aren't draining (zombie process holding KEK).
If step 5 fails AFTER it ran, the old KEK may be partially retired but still referenced.
Stop-the-bleed
- DO NOT switch the pointer back. TLA+ invariant
RetireAfterSwitchrequires the old KEK retire after the switch. Reverting is a no-op at best, integrity break at worst. - Verify cipher reads still succeed. Both old + new KEK should
still be usable for unwrapping:
matter ops kms-dual-unwrap --kek-id <old> --corpus /forensics/known-cipher-corpus matter ops kms-dual-unwrap --kek-id <new> --corpus /forensics/known-cipher-corpus
Diagnose
Step 4 hang
Find the process holding the old KEK reference:
matter ops kms-reference-trackerCommon causes:
- Long-running transaction holding a decryption lock.
- Zombie consumer process.
Step 5 partial
Check KMS state directly:
matter ops kms-state --kek-id <old>If state=retiring, KMS may still be processing the retirement
request. Wait + retry.
If state=retired but ciphers still reference it, that's an
invariant break — escalate to security leadership.
Recover
matter ops kms-rotation-resume --rotation-id <id>The saga picks up from the failed step. Compensation isn't applicable post-switch (TLA+ guarantees forward-only).
Validate
matter ops verify-kms-rotation --rotation-id <id>Expected:
- Old KEK in
retiredstate. - New KEK in
activestate. - Every cipher row references the new KEK (
NoKeyOrphansinvariant). - Rekor anchor cron processed the rotation event.
Communicate
- Customer email if encryption affected any customer-visible operation > 30 minutes.
- Status page if rotation is customer-announced (rare).
- Postmortem within 5 days, externally published.