Architecture
Capacity plan
How Matter scales. Six stages from single-region single-primary to dedicated webhook delivery cluster, each with a concrete trigger condition and a documented runbook. Leading indicators tracked from day 1; dashboards live at P11.
Last updated
Matter scales in stages, not gradients. Each stage has a concrete trigger condition derived from leading indicators; entering a stage is a planning event with a documented runbook. SLOs do not change across stages — the customer sees the same latency and availability budget whether we are running a single Postgres primary or a dedicated webhook delivery cluster.
Leading indicators are tracked from P0 (packages/observability/src/leading-indicators.ts); dashboards surface them at P11.5. Capacity planning is a quarterly review owned by the Platform context.
The six stages
Stage 1 — Single region, single primary (default)
The Day-0 deployment. One Vercel project, one Postgres primary (Neon serverless adapter), one Upstash Redis, one KMS region, one Sigstore Rekor witness.
- Trigger to enter: the default. Stage 1 is where every new deployment starts.
- Trigger to exit (to Stage 2): read QPS exceeds 500 sustained, OR p99 read latency drift exceeds 20 % over the SLO baseline.
- Capacity ceiling: ~5k QPS reads, ~500 QPS writes, ~50k entities, ~10M events/month.
- Runbook: none required — this is the starting state.
Stage 2 — Read replicas
Read traffic routes to one or more Postgres read replicas via the withReplica() helper in packages/database/src/replica-router.ts (harness shipped P0.H1). Operations declaring x-matter-consistency: eventual may use replicas; strict always routes to primary.
- Trigger to enter: read QPS > 500 sustained, OR p99 read latency drift > 20 % over baseline, OR primary CPU > 50 % sustained.
- Trigger to exit (to Stage 3):
EventorAuditEntryrow count exceeds 100M, OR write p99 exceeds 50 % of SLO ceiling. - Capacity ceiling: ~50k QPS reads, write capacity unchanged.
- Runbook:
apps/docs/content/docs/runbooks/region-failover.mdx(Stage 2 setup),apps/docs/content/docs/runbooks/cdc-lag-recovery.mdx(lag handling). - SLO impact: read p99 may rise by ≤ 50 ms due to replica lag; absorbed by the read SLO budget.
Stage 3 — Partition hot tables
Hot tables (Event, AuditEntry, IdempotencyRecord) are range-partitioned. Event by (orgId hash, week). AuditEntry by (orgId hash, month). IdempotencyRecord by created_at (daily).
Partitioning migration ships at P0.E10. Rolling partition creation is automated by a cron. Partition pruning is verified by EXPLAIN-stability tests.
- Trigger to enter:
EventorAuditEntry> 100M rows, OR write p99 > 50 % SLO ceiling. - Trigger to exit (to Stage 4): a high-trust customer requires a dedicated DB tier, OR a SOC 2 separation-of-duties finding requires it.
- Capacity ceiling: ~10B rows per partitioned table without latency degradation.
- Runbook:
apps/docs/content/docs/runbooks/partition-split.mdx. - SLO impact: none expected; partitioning preserves latency if pruning works correctly. The EXPLAIN-stability test guards against pruning regressions.
Stage 4 — Per-tenant DB tier ladder
High-trust customers move to a dedicated tier. Three tiers: shared (default), dedicated-pool (separate connection pool, same DB cluster), dedicated-cluster (separate DB cluster, separate KMS sub-region). Tier is set at portfolio level via the dashboard or via POST /v1/portfolios/{id}/tier.
- Trigger to enter: a customer requires dedicated infrastructure, OR a SOC 2 finding requires separation of duties.
- Trigger to exit (to Stage 5): EU-residency customer signed, OR APAC p99 exceeds 1.5× US p99.
- Capacity ceiling: per-customer ceilings as the customer's own contract; aggregated cluster capacity scales horizontally.
- Runbook:
apps/docs/content/docs/runbooks/tenant-tier-migration.mdx. - SLO impact: dedicated tiers carry the same SLOs as shared. Migration between tiers is a one-time event with explicit customer comms.
Stage 5 — Multi-region
US (default) plus EU. Per-region Postgres, per-region KMS, per-region Sigstore Rekor witness. Logical sharding by region column. Customer's region is pinned at organisation creation; data routing follows.
- Trigger to enter: EU-residency customer signed (data must not leave EU), OR APAC p99 > 1.5× US p99 (latency-driven expansion).
- Trigger to exit (to Stage 6): webhook deliveries > 1M/day, OR webhook p99 > 2× SLO.
- Capacity ceiling: per-region capacity unchanged from Stage 4; cross-region capacity additive.
- Runbook:
apps/docs/content/docs/runbooks/region-failover.mdx. - SLO impact: cross-region writes (rare; only for migrations) carry a relaxed p99 documented in the consistency contract. Per-region operations see SLO targets met.
Stage 6 — Dedicated webhook delivery cluster
Outbound webhook delivery moves to its own worker pool. Sharded by (endpoint_id MOD N). Independent backpressure per shard. Independent scaling from API request handling.
- Trigger to enter: webhook deliveries > 1M/day, OR webhook p99 receiver-side > 2× SLO ceiling for ≥ 30 minutes sustained.
- Trigger to exit: none anticipated. Stage 6 is horizontally scalable.
- Capacity ceiling: scales with worker pool size; per-shard ceiling ~100k deliveries/day with current sharding constants.
- Runbook: P11.25 implementation.
- SLO impact: webhook p99 maintained as event volume grows.
Leading indicators
These metrics drive stage-transition decisions. Tracked from P0; dashboards land at P11.5. Quarterly review owned by Platform context.
| Indicator | Stage trigger | Source | Alert threshold |
|---|---|---|---|
| Read replica lag p99 | Stage 2 (operational) | CDC consumer metrics | > 1 s sustained |
AuditEntry row growth rate | Stage 3 trigger | DB metrics | > 1M / month sustained |
Event row growth rate | Stage 3 + 6 trigger | DB metrics | > 5M / month sustained |
| Sandbox vs Live write ratio | Capacity planning | App metrics | per-org outliers > 100x |
| Webhook queue depth p99 per endpoint | Stage 6 trigger | Webhook worker metrics | > 1000 queued sustained |
| KMS API call rate | Rotation-window awareness | KMS metrics | > 80 % of plan quota |
| CDC lag p99 per read model | Stage 2 + DB health | CDC worker metrics | > 1 s for 5 min |
| Postgres connection-pool utilization per role | Capacity planning | DB metrics | > 80 % sustained |
| Per-region p99 latency | Stage 5 trigger | External probes | non-US region > 1.5× US |
| Cost-per-request per operation | Cost engineering | Cost tracker | > 110 % of budget |
| Cache hit ratio per cached operation | Cache health | Cache metrics | < 80 % of target |
| MCP tool-call rate per tier | Anomaly + capacity | MCP metrics | tier outliers |
| Deprecated-endpoint usage per customer | Deprecation comms | Endpoint metrics | > 0 monthly active |
Day-1 capacity validation
Before Matter accepts production traffic, the P0 load-test baseline (tooling/load/scenarios/) validates that Stage 1 capacity is real. Concrete drills:
- Throughput. Sustained 500 read QPS + 50 write QPS for 30 minutes against a single-primary deploy with read replicas off. All SLOs met.
- Mixed workload. 80 % reads / 15 % cached-reads / 5 % writes across operation classes. All SLOs met.
- Surge. 5× baseline for 10 minutes. Surge mode auto-engages (P0.E15); graceful degradation observed; critical paths unaffected.
- Pool exhaustion. 10× baseline. Pool sized per P0.E13; load shedding (P0.E19) engages cleanly; no cascading failures.
- Cold start. Continuous low-volume traffic from cold deploy. p999 within budget after warmup completes.
- CDC throughput. 10k writes/sec through CDC. Read-model lag p99 < 1 s.
Drill outcomes recorded at apps/api/__gates__/p0/capacity-baseline.md.
Failure modes guarded by the stages
Each stage transition is preceded by a failure mode the previous stage cannot handle. Understanding the failure modes is more important than memorising the numbers.
- Stage 1 → 2: read amplification from popular CapTable views. Replicas absorb.
- Stage 2 → 3: sequential scans on the audit chain or event log. Partitions prune.
- Stage 3 → 4: SOC 2 SoD findings, or a customer with regulator-required isolation. Dedicated tiers.
- Stage 4 → 5: GDPR residency, or latency from a non-US customer base. Multi-region.
- Stage 5 → 6: webhook fan-out becomes the bottleneck independent of API traffic. Dedicated cluster.
Multi-region operational notes
Once at Stage 5:
- Customer's region is pinned at organisation creation via
POST /v1/portfolios. Theregioncolumn on every tenant row determines routing. - Cross-region reads (e.g., listing entities by org_id where the org spans regions — impossible by default policy) require an admin endpoint with elevated audit.
- Cross-region writes are forbidden for ordinary mutations. Migration between regions is a planned event with explicit customer comms and a dedicated runbook.
- Each region carries its own KMS sub-region (data and keys do not leave the region) and its own Sigstore Rekor witness (audit anchoring stays regional).
- The customer-facing SLA applies per-region. A US outage does not trigger EU credits.
How to plan a stage transition
- Quarterly capacity review. Platform owns. Reviews leading indicators against thresholds. Identifies any indicator approaching its alert threshold.
- RFC. If a transition is forecast within 6 months, the proposer drafts an RFC at
apps/docs/rfcs/describing the trigger, the implementation work, the customer comms plan, the rollback plan. - API Council approval. Discussed in the weekly forum.
- Production-readiness review. Before the transition completes, the four audits gate the work as if it were a feature phase.
- Customer comms. Stage transitions that change customer-visible behaviour (Stage 5 multi-region, Stage 4 dedicated tier) are pre-announced via email + status page.
- Game day. Cross-team drill against the transition's runbook before the transition is performed live.
See also
- Architecture overview — the system that scales.
- Bounded contexts — context boundaries that survive across stages.
- Matter API SLOs — what every stage must continue to meet.
- Customer SLA — what credits a customer earns if SLOs are missed.