Runbooks
Webhook delivery recovery
SEV3 runbook for webhook delivery p99 breaching the 5s SLO.
Last updated
Webhook delivery recovery (SEV3)
Triggered by: webhook delivery p99 > 5s sustained for ≥ 15 minutes on ≥ 1 customer's endpoint, OR delivery success rate drops below 99%.
On-call: Platform. Estimated MTTR: 30-60 minutes.
Diagnose
matter ops webhook-status --customer <id> --window 1hIdentify the symptom:
- Receiver-side slow (their p99 high): apply our backoff +
notify the customer. SLA credit per
apps/api/lib/webhook-slo.tsif sustained. - Our queue back-pressure (shard hot): rebalance via
apps/api/lib/webhook-sharding.ts. - Signature verification failure on receiver: check the customer's secret rotation; resync.
Recover
For queue back-pressure, scale out the affected shard:
matter ops webhook-shard-scale --shard <id> --workers <n>For signature issues:
matter ops webhook-secret-rotate --endpoint <id>Customer must update their receiver with the new secret within the dual-sign overlap window (P0.C6).
Replay missed deliveries
If deliveries went to failed_dead:
matter ops webhook-redeliver --event <id> --endpoint <id>Up to 30 days post-event (P0.E8 replay window).
Validate
matter ops verify-webhook-slo --customer <id>Expected: p99 < 5s, success rate ≥ 99%.