mcpbr¶
Benchmark your MCP server against real GitHub issues. One command, hard numbers.
Model Context Protocol Benchmark Runner
Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.
What You Get¶
Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.
Evaluation Results
Summary
+-----------------+-----------+----------+
| Metric | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved | 8/25 | 5/25 |
| Resolution Rate | 32.0% | 20.0% |
+-----------------+-----------+----------+
Improvement: +60.0%
Why mcpbr?¶
MCP servers promise to make LLMs better at coding tasks. But how do you prove it?
mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:
- Apples-to-apples comparison against a baseline agent
- Real GitHub issues from SWE-bench (not toy examples)
- Reproducible results via Docker containers with pinned dependencies
Quick Start¶
1. Set your API key¶
2. Generate a configuration file¶
3. Edit the configuration¶
Point it to your MCP server:
mcp_server:
command: "npx"
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
env: {}
provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4
4. Run the evaluation¶
How It Works¶
mcpbr runs two parallel evaluations for each SWE-bench task:
- MCP Agent: LLM with access to tools from your MCP server
- Baseline Agent: Same LLM without MCP tools
By comparing resolution rates, you can measure the effectiveness of your MCP server for code exploration and bug fixing.
Host Machine
+-----------------------------------------------------------+
| mcpbr Harness (Python) |
| - Loads SWE-bench tasks from HuggingFace |
| - Pulls pre-built Docker images |
| - Orchestrates agent runs |
| - Collects results and generates reports |
+----------------------------+------------------------------+
| docker exec
+----------------------------v------------------------------+
| Docker Container (per task) |
| - Repository at correct commit |
| - All dependencies pre-installed |
| - Claude Code CLI runs inside container |
| - Generates patches and runs tests |
+-----------------------------------------------------------+
Next Steps¶
- Installation - Prerequisites and installation options
- Configuration - Full configuration reference
- CLI Reference - All available commands and options
- MCP Integration - How to test your MCP server