MCP
Evaluation
10 read-only questions against the fixture dataset. Target: ≥9/10 accuracy on Claude Sonnet.
Last updated
Every MCP server should ship a stable evaluation. Matter's evaluation runs ten read-only, multi-hop questions against a versioned fixture dataset.
Running an evaluation
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-ant-...
export MATTER_EVAL_KEY=sk_test_eval_matter_fixture_R8fQ3gN5
python scripts/evaluation.py \
-t http \
-u https://mcp.mattermode.com \
-H "Authorization: Bearer $MATTER_EVAL_KEY" \
-H "Matter-Version: 2026-05-01" \
apps/docs/mcp/evaluation.xmlOutput summary:
Summary
accuracy: 10/10
duration: 3m 42s
tool calls: 47The fixture dataset
The sk_test_eval_… key exposes a versioned fixture containing:
- Waypoint Systems, Inc. — Delaware C-Corp,
active, formed 2026-04-01, 2 founders (Jane Doe CEO 80%, Michael Smith CTO 20%), 10M authorized common, 2M option pool, one 409A on file. - Corestar Enterprises — Delaware C-Corp,
active, formed 2024-07-15, multiple amendments filed. - Studio42 portfolio — 8 portfolio entities spanning formation through dissolution.
- A pending
CorporateTransaction— merger between Waypoint and Corestar, currently indefinitivestage. - An entity mid-dissolution — fixture_wd_42, status
winding_down, three filings submitted.
The fixture is immutable per release. Every evaluation answer is stable. Point to the fixture dataset for the full inventory.
Target thresholds
| Model | Target |
|---|---|
| Claude Opus | 10/10 |
| Claude Sonnet | ≥9/10 |
| Claude Haiku | ≥8/10 |
Below-threshold runs surface as regressions in CI — they block Matter MCP server releases until addressed.
What failure looks like
When the evaluation fails, common causes include:
- Tool descriptions too terse — agent can't find the right tool.
- Pagination defaults too small — agent times out iterating.
- Error messages too generic — agent can't recover from a routine 409.
- Missing
expand[]hints — agent doesn't know it can inflate related resources.
Every one of these surfaces as feedback in the evaluation report. The runbook is in
apps/mcp/CONTRIBUTING.md.
The questions
See apps/docs/mcp/evaluation.xml
for the full XML. Ten questions, all read-only, all with single verifiable answers.