Runbooks
KMS rotation
Quarterly + emergency KMS key rotation procedure.
Last updated
KMS rotation
Quarterly KEK rotation per packages/crypto/rotation.tla
state machine. Emergency rotation (suspected compromise) follows the
same procedure with the PreSwitchFailure branch never invoked.
On-call: Security (primary), Platform (secondary). Cadence: Quarterly. Estimated duration: 2 hours (drained traffic permitting).
Pre-flight (T-7 days)
- Schedule the rotation window. Off-peak hours; status page pre-announced (≥ 7 days, per customer contract).
- Verify drill freshness:
apps/api/lib/runbook-registry.tsdrillStatus()forkms_rotation_failureshould befresh. - Confirm KMS vendor health: dashboards green for 7+ days.
- Confirm Rekor + WORM bucket health: anchor cron green.
Execute
The saga modelled in packages/crypto/rotation.tla:
Step 1 — mint_new_kek
matter ops kms-mint-kek --label "Q2-2026-rotation"Produces a new KEK in the same KMS region. Marks it inactive (no ciphers wrapped under it yet).
Step 2 — dual_unwrap_test
matter ops kms-dual-unwrap --kek-id <new_kek_id> --corpus /forensics/known-cipher-corpusEncrypts a known-plaintext canary under the new KEK, then decrypts it back. Must roundtrip exactly. If this fails, run PreSwitchFailure immediately (next-step list below).
Step 3 — switch_kek_pointer
The critical atomic step. Updates the pointer-of-truth to the new KEK. All new cipher writes use the new KEK; reads transparently unwrap either old or new.
matter ops kms-switch-pointer --kek-id <new_kek_id> --confirmAfter this point, the PreSwitchFailure branch is no longer available. We commit to the rotation.
Step 4 — wait_inflight_drain
In-flight requests that loaded the old KEK reference need to drain before we retire it.
matter ops kms-watch-drain --old-kek-id <old_kek_id> --max-wait 30mWatches the in-flight-reference counter. When it hits zero (typically within 10-15 minutes), proceeds.
Step 5 — retire_old_kek
matter ops kms-retire-kek --kek-id <old_kek_id> --confirmMarks the old KEK inactive. Any future reference to it raises an "orphaned key" alarm.
PreSwitchFailure (mid-rotation rollback)
If Step 1 or Step 2 fails (TLA+ PreSwitchFailure branch), the new
KEK is retired immediately. Pointer never switched; old KEK remains
the canonical encryption key.
matter ops kms-retire-kek --kek-id <new_kek_id> --reason "pre-switch failure" --confirmThen debug + retry. Customer-facing SLA unaffected (no pointer switch ever happened).
Validate
After Step 5:
matter ops verify-kms-rotation --new-kek-id <new_kek_id>Expected:
- Every recent cipher row's keyId resolves to the new KEK.
- Old KEK marked inactive.
- Rekor anchor cron processed the rotation event.
- Audit chain extended with
kms_rotation_completedrow.
TLA+ invariants verified
This procedure is modelled at packages/crypto/rotation.tla
and verifies:
- NoKeyOrphans — every cipher row's
keyIdresolves to an active KEK at every moment. - RetireAfterSwitch — the old KEK is retired only after the pointer has switched.