Runbooks
Saga compensation failure
SEV2 runbook for a saga compensation that itself fails mid-rollback.
Last updated
Saga compensation failure (SEV2)
Triggered by: the saga runner reports compensation_failed for one
or more steps during a rollback path. The saga has neither
completed forward nor cleanly rolled back; it's in an undefined
intermediate state until we intervene.
On-call: Platform (primary), bounded-context owner (secondary). Estimated MTTR: 30-90 minutes per affected saga instance.
Stop-the-bleed
- Freeze new saga starts of this kind:
matter ops freeze-saga --kind <saga_kind> --reason "compensation failure investigation" - Capture saga instance state for forensics:
matter ops snapshot-saga --instance-id <id> --target /forensics/<incident_id>/ - Identify in-flight workers; let them finish but don't accept new.
Diagnose
Read the saga instance from apps/api/lib/saga-property.ts shape:
- Which step failed forward?
- Which compensator failed?
- What was the state when compensation began?
Three failure modes:
Mode A: Compensator-resource unavailable
The compensator tried to contact a provider that's currently down (e.g. bank API for unwinding a transfer). Retry-with-backoff.
matter ops resume-saga --instance-id <id> --strategy retry-with-backoffMode B: Compensator-logic bug
The compensator code itself has a defect (e.g. it tries to unwind a transfer the system never wrote because the forward path errored before commit).
Action: manually compute what state should result from the
unwinding. Patch the saga instance state via a backfill script
(reviewed by 2 engineers). Mark the instance compensated.
matter ops backfill-saga --instance-id <id> --target-state compensated --reviewed-by <reviewer-1>,<reviewer-2>Mode C: Non-compensable step already committed
Per the TLA+ specs, non-compensable steps should halt the saga
rather than triggering compensation. If we see a non-compensable
step marked compensated, that's an invariant violation — escalate
to SEV1 and follow the audit-chain recovery flow.
Recover
For Mode A + B, after manual intervention:
matter ops resume-saga --instance-id <id>Verify post-resume:
- Saga state is one of
succeeded|compensated|failed_halted. - The forward-effects + compensation-effects ledger balance to zero for any side effect that should have been undone.
- Customer-facing state matches (request-status endpoint).
Validate
matter ops verify-saga --instance-id <id>Expected:
- Every step's state is in
{succeeded, compensated, failed_halted, skipped}. - The dependency DAG was honoured (no
succeededstep with un-succeededpredecessors). - Per-step audit rows match observed transitions.
Communicate
- If the saga is customer-facing (round close, formation packet,
dissolution), email the affected customer the same day. Apologise
- state the unwind result + give an updated timeline.
- Status page update if SEV2 affecting > 1% of saga starts.
Post-recovery action items
- Property test for the failure mode using
apps/api/lib/saga-property.tsharness. - Add the failure mode to chaos cadence (P11.18) if it isn't covered.
- If Mode C: full audit-chain integrity check.