Skip to content

Understanding Evaluation Results

This guide explains how to interpret and analyze mcpbr evaluation results.

Console Output

When running an evaluation, mcpbr displays real-time progress and a final summary.

Verbose Mode (-v)

mcpbr Evaluation
  Config: config.yaml
  Provider: anthropic
  Model: sonnet
  Agent Harness: claude-code
  Dataset: SWE-bench/SWE-bench_Lite
  Sample size: 10
  Run MCP: True, Run Baseline: True
  Pre-built images: True
  Log dir: my-logs

Loading dataset: SWE-bench/SWE-bench_Lite
Evaluating 10 tasks
Provider: anthropic, Harness: claude-code
14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
14:23:22 astropy-12907:mcp    > TodoWrite
14:23:22 astropy-12907:mcp    < Todos have been modified successfully...
14:23:26 astropy-12907:mcp    > Glob
14:23:26 astropy-12907:mcp    > Grep
14:23:27 astropy-12907:mcp    < $WORKDIR/astropy/modeling/separable.py
14:27:43 astropy-12907:mcp    * done turns=31 tokens=115/6,542

Legend:

  • > Tool call started
  • < Tool result received
  • * Run completed

Summary Table

Evaluation Results

                 Summary
+-----------------+-----------+----------+
| Metric          | MCP Agent | Baseline |
+-----------------+-----------+----------+
| Resolved        | 8/25      | 5/25     |
| Resolution Rate | 32.0%     | 20.0%    |
+-----------------+-----------+----------+

Improvement: +60.0%

Per-Task Results
+------------------------+------+----------+-------+
| Instance ID            | MCP  | Baseline | Error |
+------------------------+------+----------+-------+
| astropy__astropy-12907 | PASS |   PASS   |       |
| django__django-11099   | PASS |   FAIL   |       |
| sympy__sympy-18087     | FAIL |   FAIL   |       |
+------------------------+------+----------+-------+

What "Resolved" Means

A task is considered resolved when:

  1. Patch Generated: The agent produced a non-empty diff
  2. Patch Applied: The diff applies cleanly to the repository
  3. FAIL_TO_PASS Tests Pass: Tests that were failing now pass
  4. PASS_TO_PASS Tests Pass: Existing tests still pass (no regressions)

JSON Output

Save structured results with --output:

mcpbr run -c config.yaml -o results.json

Schema

{
  "metadata": {
    "timestamp": "2026-01-17T07:23:39.871437+00:00",
    "config": {
      "model": "sonnet",
      "provider": "anthropic",
      "agent_harness": "claude-code",
      "dataset": "SWE-bench/SWE-bench_Lite",
      "sample_size": 25,
      "timeout_seconds": 600,
      "max_iterations": 30
    },
    "mcp_server": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
    }
  },
  "summary": {
    "mcp": {
      "resolved": 8,
      "total": 25,
      "rate": 0.32
    },
    "baseline": {
      "resolved": 5,
      "total": 25,
      "rate": 0.20
    },
    "improvement": "+60.0%"
  },
  "tasks": [...]
}

Per-Task Results

Each task includes detailed metrics:

{
  "instance_id": "astropy__astropy-12907",
  "mcp": {
    "patch_generated": true,
    "tokens": {
      "input": 115,
      "output": 6542
    },
    "iterations": 30,
    "tool_calls": 72,
    "tool_usage": {
      "TodoWrite": 4,
      "Task": 1,
      "Glob": 4,
      "Grep": 11,
      "Bash": 27,
      "Read": 22,
      "Write": 2,
      "Edit": 1
    },
    "resolved": true,
    "patch_applied": true,
    "fail_to_pass": {
      "passed": 2,
      "total": 2
    },
    "pass_to_pass": {
      "passed": 10,
      "total": 10
    }
  },
  "baseline": {
    "patch_generated": true,
    "tokens": {
      "input": 63,
      "output": 7615
    },
    "iterations": 30,
    "tool_calls": 57,
    "resolved": true,
    "patch_applied": true
  }
}

Key Metrics

Field Description
patch_generated Whether the agent produced a diff
patch_applied Whether the diff applied cleanly
resolved Whether all tests pass
tokens.input Input tokens consumed
tokens.output Output tokens generated
iterations Number of agent turns
tool_calls Total tool invocations
tool_usage Breakdown by tool name
fail_to_pass Tests that should now pass
pass_to_pass Regression tests
error Error message if failed

Markdown Report

Generate a human-readable report with --report:

mcpbr run -c config.yaml -r report.md

The report includes:

  • Summary statistics
  • Per-task results table
  • Analysis of which tasks each agent solved

Per-Instance Logs

For detailed debugging, use --log-dir:

mcpbr run -c config.yaml -v --log-dir logs/

This creates timestamped JSON files:

logs/
  astropy__astropy-12907_mcp_20260117_143052.json
  astropy__astropy-12907_baseline_20260117_143156.json
  django__django-11099_mcp_20260117_144023.json
  ...

Log File Contents

{
  "instance_id": "astropy__astropy-12907",
  "run_type": "mcp",
  "events": [
    {
      "type": "system",
      "subtype": "init",
      "cwd": "/workspace",
      "tools": ["Task", "Bash", "Glob", "Grep", "Read", "Edit", "Write"],
      "model": "claude-sonnet-4-5-20250929"
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "text", "text": "I'll help you fix this bug..."}
        ]
      }
    },
    {
      "type": "assistant",
      "message": {
        "content": [
          {"type": "tool_use", "name": "Grep", "input": {"pattern": "separability"}}
        ]
      }
    },
    {
      "type": "result",
      "num_turns": 31,
      "usage": {"input_tokens": 115, "output_tokens": 6542}
    }
  ]
}

Analyzing Results

Improvement Calculation

improvement = ((mcp_rate - baseline_rate) / baseline_rate) * 100

Example: If MCP resolves 32% and baseline resolves 20%:

improvement = ((0.32 - 0.20) / 0.20) * 100 = +60%

Comparing Configurations

To compare different MCP servers or settings:

import json

with open("results-server-a.json") as f:
    a = json.load(f)

with open("results-server-b.json") as f:
    b = json.load(f)

print(f"Server A: {a['summary']['mcp']['rate']:.1%}")
print(f"Server B: {b['summary']['mcp']['rate']:.1%}")

Finding Interesting Tasks

Identify tasks where MCP helped but baseline failed:

mcp_only_wins = []
for task in results["tasks"]:
    mcp_resolved = task.get("mcp", {}).get("resolved", False)
    baseline_resolved = task.get("baseline", {}).get("resolved", False)
    if mcp_resolved and not baseline_resolved:
        mcp_only_wins.append(task["instance_id"])

print("MCP solved, baseline failed:", mcp_only_wins)

Tool Usage Analysis

Understand which tools are most used:

from collections import Counter

tool_counts = Counter()
for task in results["tasks"]:
    usage = task.get("mcp", {}).get("tool_usage", {})
    tool_counts.update(usage)

print("Most used tools:", tool_counts.most_common(10))

Common Patterns

High Resolution Rate

If MCP significantly outperforms baseline:

  • Your MCP tools provide valuable functionality
  • Consider which specific tools drove the improvement

Low Resolution Rate (Both)

If neither agent performs well:

  • Tasks may be inherently difficult
  • Consider increasing timeout_seconds and max_iterations
  • Review per-instance logs for common failure modes

Similar Rates

If MCP and baseline have similar rates:

  • MCP tools may not provide additional value for these tasks
  • Built-in tools may be sufficient
  • Review tool usage to see if MCP tools are being used

Next Steps