Runbooks
Provider failover
SEV2 runbook for failing over to a secondary external provider.
Last updated
Provider failover (SEV2)
Triggered by: a filing / bank / mail / agent provider has elevated
error rate, latency, or full outage. Customer impact: filings stuck
in submitted_to_provider state for longer than SLA.
On-call: Platform (primary), Compliance (secondary). Cadence: Provider-outage runbooks rehearsed quarterly.
Stop-the-bleed
- Verify the provider really is degraded: check their status
page; cross-check against our circuit breaker state in
packages/api-providers/src/circuit-breaker.ts. - Confirm the secondary provider can take the load: check their capacity + last successful submission timestamp.
Execute
Step 1 — Open the circuit breaker
matter ops provider-failover --primary <provider> --secondary <provider> --reason "<incident>"This sets the per-jurisdiction routing in
apps/api/lib/real-submit-pipeline.ts to route new submissions to
the secondary provider. In-flight submissions stay with the
original provider — they'll resolve via that provider's eventual
recovery.
Step 2 — Monitor secondary
The first hour: per-submission verification. After confirming secondary is healthy, allow normal cadence.
Step 3 — Customer communication
Email customers with active filings:
- Their existing submission continues with original provider.
- New submissions route to secondary.
- Expected timeline impact (if any).
Recover
When the primary provider recovers:
matter ops provider-recover --provider <provider>Gradually rebalance new traffic back to the primary. Don't dump all traffic at once — start at 10% for 30 minutes, then 25%, 50%, 100%.
Validate
matter ops verify-provider-routingExpected:
- Per-jurisdiction routing matches the canonical map for non- outage providers.
- Circuit breaker state matches expected (closed for healthy, open-half for recovering).
- Latency p99 within SLO across all routed submissions.
Per-provider ACL conformance
If the failover provider hasn't been used in 30+ days, run the ACL conformance suite first to catch drift:
matter ops verify-provider-acl --provider <provider>The harness lives at
apps/api/lib/provider-acl-conformance.ts.
Post-recovery action items
- Postmortem if customer impact > 1 hour.
- Update circuit-breaker thresholds if the trip was late.
- Schedule the next quarterly chaos drill against this provider.