Runbooks
Load shed engagement
SEV3 runbook for sustained overload triggering load-shed middleware.
Last updated
Load shed engagement (SEV3)
Triggered by: queue depth > threshold OR p99 > 2× SLO sustained
for ≥ 5 minutes. apps/api/lib/middleware/load-shed.ts engages,
returning 503 + Retry-After on non-critical requests.
On-call: Platform.
What load-shed preserves vs drops
Preserved (always):
- Writes (any mutation).
- Audit endpoints.
- Dissolution path.
- Webhook delivery.
Dropped under shed:
- List endpoints (read).
- Reports + expand-heavy reads.
- Non-essential webhooks.
Diagnose
matter ops shed-statusIdentify the trigger:
- Real load spike — surge mode also engaged (P0.E15).
- Slow query — one slow op slowing the whole pool.
- Provider-side dependency lagging — provider failover may apply.
- Memory pressure — leak; bounce affected pods.
Recover
Address the underlying cause; load-shed exits automatically once queue depth + p99 drop below threshold for 5 minutes.
For manual exit:
matter ops shed-disengage --reason "<incident_id>"(Use sparingly; auto-exit is the safe path.)
Validate
matter ops verify-load-shed-disengagedExpected:
- Queue depth < 50% of threshold.
- p99 < SLO.
- No 503s for the last 10 minutes.
Post-recovery
- Tune the trigger threshold if it engaged spuriously.
- Tune the per-op "critical vs non-critical" classification if a real-critical-op got shed.