Skip to content

mcpbr

pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v

Benchmark your MCP server against real GitHub issues. One command, hard numbers.


MCPBR Logo

Model Context Protocol Benchmark Runner

PyPI version Python 3.11+ CI License: MIT

Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.

mcpbr in action

What You Get

Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 8/25      | 5/25     |
| Resolution Rate | 32.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +60.0%

Why mcpbr?

MCP servers promise to make LLMs better at coding tasks. But how do you prove it?

mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:

  • Apples-to-apples comparison against a baseline agent
  • Real GitHub issues from SWE-bench (not toy examples)
  • Reproducible results via Docker containers with pinned dependencies

Quick Start

1. Set your API key

export ANTHROPIC_API_KEY="your-api-key"

2. Generate a configuration file

mcpbr init

3. Edit the configuration

Point it to your MCP server:

mcp_server:
  command: "npx"
  args:
    - "-y"
    - "@modelcontextprotocol/server-filesystem"
    - "{workdir}"
  env: {}

provider: "anthropic"
agent_harness: "claude-code"
model: "sonnet"
dataset: "SWE-bench/SWE-bench_Lite"
sample_size: 10
timeout_seconds: 300
max_concurrent: 4

4. Run the evaluation

mcpbr run --config mcpbr.yaml

How It Works

mcpbr runs two parallel evaluations for each SWE-bench task:

  1. MCP Agent: LLM with access to tools from your MCP server
  2. Baseline Agent: Same LLM without MCP tools

By comparing resolution rates, you can measure the effectiveness of your MCP server for code exploration and bug fixing.

Host Machine
+-----------------------------------------------------------+
|                    mcpbr Harness (Python)                 |
|  - Loads SWE-bench tasks from HuggingFace                 |
|  - Pulls pre-built Docker images                          |
|  - Orchestrates agent runs                                |
|  - Collects results and generates reports                 |
+----------------------------+------------------------------+
                             | docker exec
+----------------------------v------------------------------+
|              Docker Container (per task)                  |
|  - Repository at correct commit                           |
|  - All dependencies pre-installed                         |
|  - Claude Code CLI runs inside container                  |
|  - Generates patches and runs tests                       |
+-----------------------------------------------------------+

Next Steps