Capacity plan

Last updated Jun 14, 2026

Matter scales in stages, not gradients. Each stage has a concrete trigger condition derived from leading indicators; entering a stage is a planning event with a documented runbook. SLOs do not change across stages — the customer sees the same latency and availability budget whether we are running a single Postgres primary or a dedicated webhook delivery cluster.

Leading indicators are tracked from P0 (packages/observability/src/leading-indicators.ts); dashboards surface them at P11.5. Capacity planning is a quarterly review owned by the Platform context.

The six stages

Stage 1 — Single region, single primary (default)

The Day-0 deployment. One Vercel project, one Postgres primary (Neon serverless adapter), one Upstash Redis, one KMS region, one Sigstore Rekor witness.

Trigger to enter: the default. Stage 1 is where every new deployment starts.
Trigger to exit (to Stage 2): read QPS exceeds 500 sustained, OR p99 read latency drift exceeds 20 % over the SLO baseline.
Capacity ceiling: ~5k QPS reads, ~500 QPS writes, ~50k entities, ~10M events/month.
Runbook: none required — this is the starting state.

Stage 2 — Read replicas

Read traffic routes to one or more Postgres read replicas via the withReplica() helper in packages/database/src/replica-router.ts (harness shipped P0.H1). Operations declaring x-matter-consistency: eventual may use replicas; strict always routes to primary.

Trigger to enter: read QPS > 500 sustained, OR p99 read latency drift > 20 % over baseline, OR primary CPU > 50 % sustained.
Trigger to exit (to Stage 3): Event or AuditEntry row count exceeds 100M, OR write p99 exceeds 50 % of SLO ceiling.
Capacity ceiling: ~50k QPS reads, write capacity unchanged.
Runbook: apps/docs/content/docs/runbooks/region-failover.mdx (Stage 2 setup), apps/docs/content/docs/runbooks/cdc-lag-recovery.mdx (lag handling).
SLO impact: read p99 may rise by ≤ 50 ms due to replica lag; absorbed by the read SLO budget.

Stage 3 — Partition hot tables

Hot tables (Event, AuditEntry, IdempotencyRecord) are range-partitioned. Event by (orgId hash, week). AuditEntry by (orgId hash, month). IdempotencyRecord by created_at (daily).

Partitioning migration ships at P0.E10. Rolling partition creation is automated by a cron. Partition pruning is verified by EXPLAIN-stability tests.

Trigger to enter: Event or AuditEntry > 100M rows, OR write p99 > 50 % SLO ceiling.
Trigger to exit (to Stage 4): a high-trust customer requires a dedicated DB tier, OR a SOC 2 separation-of-duties finding requires it.
Capacity ceiling: ~10B rows per partitioned table without latency degradation.
Runbook: apps/docs/content/docs/runbooks/partition-split.mdx.
SLO impact: none expected; partitioning preserves latency if pruning works correctly. The EXPLAIN-stability test guards against pruning regressions.

Stage 4 — Per-tenant DB tier ladder

High-trust customers move to a dedicated tier. Three tiers: shared (default), dedicated-pool (separate connection pool, same DB cluster), dedicated-cluster (separate DB cluster, separate KMS sub-region). Tier is set at portfolio level via the dashboard or via POST /v1/portfolios/{id}/tier.

Trigger to enter: a customer requires dedicated infrastructure, OR a SOC 2 finding requires separation of duties.
Trigger to exit (to Stage 5): EU-residency customer signed, OR APAC p99 exceeds 1.5× US p99.
Capacity ceiling: per-customer ceilings as the customer's own contract; aggregated cluster capacity scales horizontally.
Runbook: apps/docs/content/docs/runbooks/tenant-tier-migration.mdx.
SLO impact: dedicated tiers carry the same SLOs as shared. Migration between tiers is a one-time event with explicit customer comms.

Stage 5 — Multi-region

US (default) plus EU. Per-region Postgres, per-region KMS, per-region Sigstore Rekor witness. Logical sharding by region column. Customer's region is pinned at organisation creation; data routing follows.

Trigger to enter: EU-residency customer signed (data must not leave EU), OR APAC p99 > 1.5× US p99 (latency-driven expansion).
Trigger to exit (to Stage 6): webhook deliveries > 1M/day, OR webhook p99 > 2× SLO.
Capacity ceiling: per-region capacity unchanged from Stage 4; cross-region capacity additive.
Runbook: apps/docs/content/docs/runbooks/region-failover.mdx.
SLO impact: cross-region writes (rare; only for migrations) carry a relaxed p99 documented in the consistency contract. Per-region operations see SLO targets met.

Stage 6 — Dedicated webhook delivery cluster

Outbound webhook delivery moves to its own worker pool. Sharded by (endpoint_id MOD N). Independent backpressure per shard. Independent scaling from API request handling.

Trigger to enter: webhook deliveries > 1M/day, OR webhook p99 receiver-side > 2× SLO ceiling for ≥ 30 minutes sustained.
Trigger to exit: none anticipated. Stage 6 is horizontally scalable.
Capacity ceiling: scales with worker pool size; per-shard ceiling ~100k deliveries/day with current sharding constants.
Runbook: P11.25 implementation.
SLO impact: webhook p99 maintained as event volume grows.

Leading indicators

These metrics drive stage-transition decisions. Tracked from P0; dashboards land at P11.5. Quarterly review owned by Platform context.

Indicator	Stage trigger	Source	Alert threshold
Read replica lag p99	Stage 2 (operational)	CDC consumer metrics	> 1 s sustained
`AuditEntry` row growth rate	Stage 3 trigger	DB metrics	> 1M / month sustained
`Event` row growth rate	Stage 3 + 6 trigger	DB metrics	> 5M / month sustained
Sandbox vs Live write ratio	Capacity planning	App metrics	per-org outliers > 100x
Webhook queue depth p99 per endpoint	Stage 6 trigger	Webhook worker metrics	> 1000 queued sustained
KMS API call rate	Rotation-window awareness	KMS metrics	> 80 % of plan quota
CDC lag p99 per read model	Stage 2 + DB health	CDC worker metrics	> 1 s for 5 min
Postgres connection-pool utilization per role	Capacity planning	DB metrics	> 80 % sustained
Per-region p99 latency	Stage 5 trigger	External probes	non-US region > 1.5× US
Cost-per-request per operation	Cost engineering	Cost tracker	> 110 % of budget
Cache hit ratio per cached operation	Cache health	Cache metrics	< 80 % of target
MCP tool-call rate per tier	Anomaly + capacity	MCP metrics	tier outliers
Deprecated-endpoint usage per customer	Deprecation comms	Endpoint metrics	> 0 monthly active

Day-1 capacity validation

Before Matter accepts production traffic, the P0 load-test baseline (tooling/load/scenarios/) validates that Stage 1 capacity is real. Concrete drills:

Throughput. Sustained 500 read QPS + 50 write QPS for 30 minutes against a single-primary deploy with read replicas off. All SLOs met.
Mixed workload. 80 % reads / 15 % cached-reads / 5 % writes across operation classes. All SLOs met.
Surge. 5× baseline for 10 minutes. Surge mode auto-engages (P0.E15); graceful degradation observed; critical paths unaffected.
Pool exhaustion. 10× baseline. Pool sized per P0.E13; load shedding (P0.E19) engages cleanly; no cascading failures.
Cold start. Continuous low-volume traffic from cold deploy. p999 within budget after warmup completes.
CDC throughput. 10k writes/sec through CDC. Read-model lag p99 < 1 s.

Drill outcomes recorded at apps/api/__gates__/p0/capacity-baseline.md.

Failure modes guarded by the stages

Each stage transition is preceded by a failure mode the previous stage cannot handle. Understanding the failure modes is more important than memorising the numbers.

Stage 1 → 2: read amplification from popular CapTable views. Replicas absorb.
Stage 2 → 3: sequential scans on the audit chain or event log. Partitions prune.
Stage 3 → 4: SOC 2 SoD findings, or a customer with regulator-required isolation. Dedicated tiers.
Stage 4 → 5: GDPR residency, or latency from a non-US customer base. Multi-region.
Stage 5 → 6: webhook fan-out becomes the bottleneck independent of API traffic. Dedicated cluster.

Multi-region operational notes

Once at Stage 5:

Customer's region is pinned at organisation creation via POST /v1/portfolios. The region column on every tenant row determines routing.
Cross-region reads (e.g., listing entities by org_id where the org spans regions — impossible by default policy) require an admin endpoint with elevated audit.
Cross-region writes are forbidden for ordinary mutations. Migration between regions is a planned event with explicit customer comms and a dedicated runbook.
Each region carries its own KMS sub-region (data and keys do not leave the region) and its own Sigstore Rekor witness (audit anchoring stays regional).
The customer-facing SLA applies per-region. A US outage does not trigger EU credits.

How to plan a stage transition

Quarterly capacity review. Platform owns. Reviews leading indicators against thresholds. Identifies any indicator approaching its alert threshold.
RFC. If a transition is forecast within 6 months, the proposer drafts an RFC at apps/docs/rfcs/ describing the trigger, the implementation work, the customer comms plan, the rollback plan.
API Council approval. Discussed in the weekly forum.
Production-readiness review. Before the transition completes, the four audits gate the work as if it were a feature phase.
Customer comms. Stage transitions that change customer-visible behaviour (Stage 5 multi-region, Stage 4 dedicated tier) are pre-announced via email + status page.
Game day. Cross-team drill against the transition's runbook before the transition is performed live.

The six stages

Stage 1 — Single region, single primary (default)

The Day-0 deployment. One Vercel project, one Postgres primary (Neon serverless adapter), one Upstash Redis, one KMS region, one Sigstore Rekor witness.

Trigger to enter: the default. Stage 1 is where every new deployment starts.
Trigger to exit (to Stage 2): read QPS exceeds 500 sustained, OR p99 read latency drift exceeds 20 % over the SLO baseline.
Capacity ceiling: ~5k QPS reads, ~500 QPS writes, ~50k entities, ~10M events/month.
Runbook: none required — this is the starting state.

Stage 2 — Read replicas

Trigger to enter: read QPS > 500 sustained, OR p99 read latency drift > 20 % over baseline, OR primary CPU > 50 % sustained.
Trigger to exit (to Stage 3): Event or AuditEntry row count exceeds 100M, OR write p99 exceeds 50 % of SLO ceiling.
Capacity ceiling: ~50k QPS reads, write capacity unchanged.
Runbook: apps/docs/content/docs/runbooks/region-failover.mdx (Stage 2 setup), apps/docs/content/docs/runbooks/cdc-lag-recovery.mdx (lag handling).
SLO impact: read p99 may rise by ≤ 50 ms due to replica lag; absorbed by the read SLO budget.

Stage 3 — Partition hot tables

Hot tables (Event, AuditEntry, IdempotencyRecord) are range-partitioned. Event by (orgId hash, week). AuditEntry by (orgId hash, month). IdempotencyRecord by created_at (daily).

Partitioning migration ships at P0.E10. Rolling partition creation is automated by a cron. Partition pruning is verified by EXPLAIN-stability tests.

Trigger to enter: Event or AuditEntry > 100M rows, OR write p99 > 50 % SLO ceiling.
Trigger to exit (to Stage 4): a high-trust customer requires a dedicated DB tier, OR a SOC 2 separation-of-duties finding requires it.
Capacity ceiling: ~10B rows per partitioned table without latency degradation.
Runbook: apps/docs/content/docs/runbooks/partition-split.mdx.
SLO impact: none expected; partitioning preserves latency if pruning works correctly. The EXPLAIN-stability test guards against pruning regressions.

Stage 4 — Per-tenant DB tier ladder

Trigger to enter: a customer requires dedicated infrastructure, OR a SOC 2 finding requires separation of duties.
Trigger to exit (to Stage 5): EU-residency customer signed, OR APAC p99 exceeds 1.5× US p99.
Capacity ceiling: per-customer ceilings as the customer's own contract; aggregated cluster capacity scales horizontally.
Runbook: apps/docs/content/docs/runbooks/tenant-tier-migration.mdx.
SLO impact: dedicated tiers carry the same SLOs as shared. Migration between tiers is a one-time event with explicit customer comms.

Stage 5 — Multi-region

Trigger to enter: EU-residency customer signed (data must not leave EU), OR APAC p99 > 1.5× US p99 (latency-driven expansion).
Trigger to exit (to Stage 6): webhook deliveries > 1M/day, OR webhook p99 > 2× SLO.
Capacity ceiling: per-region capacity unchanged from Stage 4; cross-region capacity additive.
Runbook: apps/docs/content/docs/runbooks/region-failover.mdx.
SLO impact: cross-region writes (rare; only for migrations) carry a relaxed p99 documented in the consistency contract. Per-region operations see SLO targets met.

Stage 6 — Dedicated webhook delivery cluster

Outbound webhook delivery moves to its own worker pool. Sharded by (endpoint_id MOD N). Independent backpressure per shard. Independent scaling from API request handling.

Trigger to enter: webhook deliveries > 1M/day, OR webhook p99 receiver-side > 2× SLO ceiling for ≥ 30 minutes sustained.
Trigger to exit: none anticipated. Stage 6 is horizontally scalable.
Capacity ceiling: scales with worker pool size; per-shard ceiling ~100k deliveries/day with current sharding constants.
Runbook: P11.25 implementation.
SLO impact: webhook p99 maintained as event volume grows.

Leading indicators

These metrics drive stage-transition decisions. Tracked from P0; dashboards land at P11.5. Quarterly review owned by Platform context.

Indicator	Stage trigger	Source	Alert threshold
Read replica lag p99	Stage 2 (operational)	CDC consumer metrics	> 1 s sustained
`AuditEntry` row growth rate	Stage 3 trigger	DB metrics	> 1M / month sustained
`Event` row growth rate	Stage 3 + 6 trigger	DB metrics	> 5M / month sustained
Sandbox vs Live write ratio	Capacity planning	App metrics	per-org outliers > 100x
Webhook queue depth p99 per endpoint	Stage 6 trigger	Webhook worker metrics	> 1000 queued sustained
KMS API call rate	Rotation-window awareness	KMS metrics	> 80 % of plan quota
CDC lag p99 per read model	Stage 2 + DB health	CDC worker metrics	> 1 s for 5 min
Postgres connection-pool utilization per role	Capacity planning	DB metrics	> 80 % sustained
Per-region p99 latency	Stage 5 trigger	External probes	non-US region > 1.5× US
Cost-per-request per operation	Cost engineering	Cost tracker	> 110 % of budget
Cache hit ratio per cached operation	Cache health	Cache metrics	< 80 % of target
MCP tool-call rate per tier	Anomaly + capacity	MCP metrics	tier outliers
Deprecated-endpoint usage per customer	Deprecation comms	Endpoint metrics	> 0 monthly active

Day-1 capacity validation

Before Matter accepts production traffic, the P0 load-test baseline (tooling/load/scenarios/) validates that Stage 1 capacity is real. Concrete drills:

Throughput. Sustained 500 read QPS + 50 write QPS for 30 minutes against a single-primary deploy with read replicas off. All SLOs met.
Mixed workload. 80 % reads / 15 % cached-reads / 5 % writes across operation classes. All SLOs met.
Surge. 5× baseline for 10 minutes. Surge mode auto-engages (P0.E15); graceful degradation observed; critical paths unaffected.
Pool exhaustion. 10× baseline. Pool sized per P0.E13; load shedding (P0.E19) engages cleanly; no cascading failures.
Cold start. Continuous low-volume traffic from cold deploy. p999 within budget after warmup completes.
CDC throughput. 10k writes/sec through CDC. Read-model lag p99 < 1 s.

Drill outcomes recorded at apps/api/__gates__/p0/capacity-baseline.md.

Failure modes guarded by the stages

Each stage transition is preceded by a failure mode the previous stage cannot handle. Understanding the failure modes is more important than memorising the numbers.

Stage 1 → 2: read amplification from popular CapTable views. Replicas absorb.
Stage 2 → 3: sequential scans on the audit chain or event log. Partitions prune.
Stage 3 → 4: SOC 2 SoD findings, or a customer with regulator-required isolation. Dedicated tiers.
Stage 4 → 5: GDPR residency, or latency from a non-US customer base. Multi-region.
Stage 5 → 6: webhook fan-out becomes the bottleneck independent of API traffic. Dedicated cluster.

Multi-region operational notes

Once at Stage 5:

Customer's region is pinned at organisation creation via POST /v1/portfolios. The region column on every tenant row determines routing.
Cross-region reads (e.g., listing entities by org_id where the org spans regions — impossible by default policy) require an admin endpoint with elevated audit.
Cross-region writes are forbidden for ordinary mutations. Migration between regions is a planned event with explicit customer comms and a dedicated runbook.
Each region carries its own KMS sub-region (data and keys do not leave the region) and its own Sigstore Rekor witness (audit anchoring stays regional).
The customer-facing SLA applies per-region. A US outage does not trigger EU credits.

How to plan a stage transition

Quarterly capacity review. Platform owns. Reviews leading indicators against thresholds. Identifies any indicator approaching its alert threshold.
RFC. If a transition is forecast within 6 months, the proposer drafts an RFC at apps/docs/rfcs/ describing the trigger, the implementation work, the customer comms plan, the rollback plan.
API Council approval. Discussed in the weekly forum.
Production-readiness review. Before the transition completes, the four audits gate the work as if it were a feature phase.
Customer comms. Stage transitions that change customer-visible behaviour (Stage 5 multi-region, Stage 4 dedicated tier) are pre-announced via email + status page.
Game day. Cross-team drill against the transition's runbook before the transition is performed live.

Capacity plan

The six stages

Stage 1 — Single region, single primary (default)

Stage 2 — Read replicas

Stage 3 — Partition hot tables

Stage 4 — Per-tenant DB tier ladder

Stage 5 — Multi-region

Stage 6 — Dedicated webhook delivery cluster

Leading indicators

Day-1 capacity validation

Failure modes guarded by the stages

Multi-region operational notes

How to plan a stage transition

See also

On this page

Capacity plan

The six stages

Stage 1 — Single region, single primary (default)

Stage 2 — Read replicas

Stage 3 — Partition hot tables

Stage 4 — Per-tenant DB tier ladder

Stage 5 — Multi-region

Stage 6 — Dedicated webhook delivery cluster

Leading indicators

Day-1 capacity validation

Failure modes guarded by the stages

Multi-region operational notes

How to plan a stage transition

See also

On this page