Runbooks
CDC lag recovery
SEV2 runbook for CDC consumer lag breaching the 1s p99 SLO.
Last updated
CDC lag recovery (SEV2)
Triggered by: CDC lag p99 > 1s sustained for ≥ 5 minutes, OR a customer reporting stale reads after a recent write.
On-call: Platform (primary). Estimated MTTR: 15-45 minutes for transient lag; 1-4 hours for slot-corruption recovery.
Diagnose
Three failure modes:
Mode A: Slot lag (most common)
The logical-replication slot consumer fell behind. Cause: long- running transaction on primary OR consumer-side back-pressure.
matter ops cdc-slot-status --slot <slot-name>If confirmed_flush_lsn is far behind restart_lsn, the consumer
is lagging. Action:
- Increase consumer concurrency (bounce + scale up).
- Identify + commit the long-running primary transaction.
Mode B: Slot advance failed
The slot is stuck because the consumer can't checkpoint forward.
Action: see apps/api/lib/replication-slot-monitor.ts. Auto-recreate
will trigger if lag exceeds threshold; manual override:
matter ops cdc-slot-rewind --slot <slot-name> --to <lsn>Mode C: Schema-mismatch in projection
The read-model reducer can't handle a recent event shape. Diagnosis: read the consumer error log; look for shape-mismatch.
Action: hotfix the reducer OR replay-from-zero on the affected projection.
Recover
matter ops cqrs-replay --projection <name> --from-sequence <break_point>Uses the replayFromSnapshot path in
apps/api/lib/cqrs-read-model.ts.
Validate
matter ops verify-cdc-lag --window 15mExpected: p99 < 1s within 15 minutes of recovery.