Evaluation

Last updated Jun 14, 2026

Every MCP server should ship a stable evaluation. Matter's evaluation runs ten read-only, multi-hop questions against a versioned fixture dataset.

Running an evaluation

pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
export MATTER_EVAL_KEY=sk_test_eval_matter_fixture_R8fQ3gN5

python scripts/evaluation.py \
  -t http \
  -u https://mcp.mattermode.com \
  -H "Authorization: Bearer $MATTER_EVAL_KEY" \
  -H "Matter-Version: 2026-05-01" \
  apps/docs/mcp/evaluation.xml

Output summary:

Summary
  accuracy: 10/10
  duration: 3m 42s
  tool calls: 47

The fixture dataset

The sk_test_eval_… key exposes a versioned fixture containing:

Waypoint Systems, Inc. — Delaware C-Corp, active, formed 2026-04-01, 2 founders (Jane Doe CEO 80%, Michael Smith CTO 20%), 10M authorized common, 2M option pool, one 409A on file.
Corestar Enterprises — Delaware C-Corp, active, formed 2024-07-15, multiple amendments filed.
Studio42 portfolio — 8 portfolio entities spanning formation through dissolution.
A pending CorporateTransaction — merger between Waypoint and Corestar, currently in definitive stage.
An entity mid-dissolution — fixture_wd_42, status winding_down, three filings submitted.

The fixture is immutable per release. Every evaluation answer is stable. Point to the fixture dataset for the full inventory.

Target thresholds

Model	Target
Claude Opus	10/10
Claude Sonnet	≥9/10
Claude Haiku	≥8/10

Below-threshold runs surface as regressions in CI — they block Matter MCP server releases until addressed.

What failure looks like

When the evaluation fails, common causes include:

Tool descriptions too terse — agent can't find the right tool.
Pagination defaults too small — agent times out iterating.
Error messages too generic — agent can't recover from a routine 409.
Missing expand[] hints — agent doesn't know it can inflate related resources.

Every one of these surfaces as feedback in the evaluation report. The runbook is in apps/mcp/CONTRIBUTING.md.

The questions

See apps/docs/mcp/evaluation.xml for the full XML. Ten questions, all read-only, all with single verifiable answers.

Evaluation

10 read-only questions against the fixture dataset. Target: ≥9/10 accuracy on Claude Sonnet.

Last updated Jun 14, 2026

Every MCP server should ship a stable evaluation. Matter's evaluation runs ten read-only, multi-hop questions against a versioned fixture dataset.

Running an evaluation

pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
export MATTER_EVAL_KEY=sk_test_eval_matter_fixture_R8fQ3gN5

python scripts/evaluation.py \
  -t http \
  -u https://mcp.mattermode.com \
  -H "Authorization: Bearer $MATTER_EVAL_KEY" \
  -H "Matter-Version: 2026-05-01" \
  apps/docs/mcp/evaluation.xml

Output summary:

Summary
  accuracy: 10/10
  duration: 3m 42s
  tool calls: 47

The fixture dataset

The sk_test_eval_… key exposes a versioned fixture containing:

Waypoint Systems, Inc. — Delaware C-Corp, active, formed 2026-04-01, 2 founders (Jane Doe CEO 80%, Michael Smith CTO 20%), 10M authorized common, 2M option pool, one 409A on file.
Corestar Enterprises — Delaware C-Corp, active, formed 2024-07-15, multiple amendments filed.
Studio42 portfolio — 8 portfolio entities spanning formation through dissolution.
A pending CorporateTransaction — merger between Waypoint and Corestar, currently in definitive stage.
An entity mid-dissolution — fixture_wd_42, status winding_down, three filings submitted.

The fixture is immutable per release. Every evaluation answer is stable. Point to the fixture dataset for the full inventory.

Target thresholds

Model	Target
Claude Opus	10/10
Claude Sonnet	≥9/10
Claude Haiku	≥8/10

Below-threshold runs surface as regressions in CI — they block Matter MCP server releases until addressed.

What failure looks like

When the evaluation fails, common causes include:

Tool descriptions too terse — agent can't find the right tool.
Pagination defaults too small — agent times out iterating.
Error messages too generic — agent can't recover from a routine 409.
Missing expand[] hints — agent doesn't know it can inflate related resources.

Every one of these surfaces as feedback in the evaluation report. The runbook is in apps/mcp/CONTRIBUTING.md.

The questions

See apps/docs/mcp/evaluation.xml for the full XML. Ten questions, all read-only, all with single verifiable answers.

Evaluation

Running an evaluation

The fixture dataset

Target thresholds

What failure looks like

The questions

On this page

Evaluation

Running an evaluation

The fixture dataset

Target thresholds

What failure looks like

The questions

On this page