Last updated Jun 14, 2026

CDC lag recovery (SEV2)

Triggered by: CDC lag p99 > 1s sustained for ≥ 5 minutes, OR a customer reporting stale reads after a recent write.

On-call: Platform (primary). Estimated MTTR: 15-45 minutes for transient lag; 1-4 hours for slot-corruption recovery.

Diagnose

Three failure modes:

Mode A: Slot lag (most common)

The logical-replication slot consumer fell behind. Cause: long- running transaction on primary OR consumer-side back-pressure.

matter ops cdc-slot-status --slot <slot-name>

If confirmed_flush_lsn is far behind restart_lsn, the consumer is lagging. Action:

Increase consumer concurrency (bounce + scale up).
Identify + commit the long-running primary transaction.

Mode B: Slot advance failed

The slot is stuck because the consumer can't checkpoint forward.

Action: see apps/api/lib/replication-slot-monitor.ts. Auto-recreate will trigger if lag exceeds threshold; manual override:

matter ops cdc-slot-rewind --slot <slot-name> --to <lsn>

Mode C: Schema-mismatch in projection

The read-model reducer can't handle a recent event shape. Diagnosis: read the consumer error log; look for shape-mismatch.

Action: hotfix the reducer OR replay-from-zero on the affected projection.

Recover

matter ops cqrs-replay --projection <name> --from-sequence <break_point>

Uses the replayFromSnapshot path in apps/api/lib/cqrs-read-model.ts.

Validate

matter ops verify-cdc-lag --window 15m

Expected: p99 < 1s within 15 minutes of recovery.

Last updated Jun 14, 2026

CDC lag recovery (SEV2)

Triggered by: CDC lag p99 > 1s sustained for ≥ 5 minutes, OR a customer reporting stale reads after a recent write.

On-call: Platform (primary). Estimated MTTR: 15-45 minutes for transient lag; 1-4 hours for slot-corruption recovery.

Diagnose

Three failure modes:

Mode A: Slot lag (most common)

The logical-replication slot consumer fell behind. Cause: long- running transaction on primary OR consumer-side back-pressure.

matter ops cdc-slot-status --slot <slot-name>

If confirmed_flush_lsn is far behind restart_lsn, the consumer is lagging. Action:

Increase consumer concurrency (bounce + scale up).
Identify + commit the long-running primary transaction.

Mode B: Slot advance failed

The slot is stuck because the consumer can't checkpoint forward.

Action: see apps/api/lib/replication-slot-monitor.ts. Auto-recreate will trigger if lag exceeds threshold; manual override:

matter ops cdc-slot-rewind --slot <slot-name> --to <lsn>

Mode C: Schema-mismatch in projection

The read-model reducer can't handle a recent event shape. Diagnosis: read the consumer error log; look for shape-mismatch.

Action: hotfix the reducer OR replay-from-zero on the affected projection.

Recover

matter ops cqrs-replay --projection <name> --from-sequence <break_point>

Uses the replayFromSnapshot path in apps/api/lib/cqrs-read-model.ts.

Validate

matter ops verify-cdc-lag --window 15m

Expected: p99 < 1s within 15 minutes of recovery.

CDC lag recovery

CDC lag recovery (SEV2)

Diagnose

Mode A: Slot lag (most common)

Mode B: Slot advance failed

Mode C: Schema-mismatch in projection

Recover

Validate

On this page

CDC lag recovery

CDC lag recovery (SEV2)

Diagnose

Mode A: Slot lag (most common)

Mode B: Slot advance failed

Mode C: Schema-mismatch in projection

Recover

Validate

On this page