Architecture¶
This page explains how mcpbr works internally and the design decisions behind it.
Overview¶
mcpbr is a benchmark runner that evaluates MCP (Model Context Protocol) servers by comparing agent performance with and without MCP tools on real GitHub issues from the SWE-bench dataset.
Execution Flow¶
+-----------------+
| mcpbr run |
+--------+--------+
|
+--------------v--------------+
| Load SWE-bench tasks from |
| HuggingFace datasets |
+--------------+--------------+
|
+-----------------------v-----------------------+
| For each task (parallel) |
| +----------------------------------------+ |
| | 1. Pull pre-built Docker image | |
| | 2. Create container with repo | |
| | 3. Install Claude CLI | |
| | 4. Run MCP agent (if enabled) | |
| | 5. Run baseline agent (if enabled) | |
| | 6. Extract patches | |
| | 7. Apply patches and run tests | |
| | 8. Record results | |
| +----------------------------------------+ |
+----------------------+-----------------------+
|
+-------------v-------------+
| Aggregate results |
| Generate reports |
+---------------------------+
Module Structure¶
src/mcpbr/
├── cli.py # Command-line interface (Click-based)
├── config.py # Configuration models (Pydantic)
├── models.py # Supported model registry
├── providers.py # LLM provider abstractions
├── harnesses.py # Agent harness implementations
├── harness.py # Main orchestrator
├── agent.py # Legacy baseline agent
├── docker_env.py # Docker environment management
├── evaluation.py # Patch application and testing
├── log_formatter.py # Streaming output formatting
└── reporting.py # Results formatting (JSON, Markdown)
Key Components¶
harness.py - Orchestrator¶
The main entry point that:
- Loads SWE-bench tasks from HuggingFace
- Creates Docker environments for each task
- Runs agents (MCP and baseline) in parallel
- Collects and aggregates results
async def run_evaluation(
config: HarnessConfig,
run_mcp: bool = True,
run_baseline: bool = True,
...
) -> EvaluationResults:
docker_env.py - Container Management¶
Manages Docker containers for isolated task execution:
DockerEnvironmentManager: Creates and manages containersTaskEnvironment: Represents a single task's environment- Handles pre-built image pulling from Epoch AI's registry
- Installs Node.js and Claude CLI inside containers
harnesses.py - Agent Implementation¶
Contains the ClaudeCodeHarness class that:
- Shells out to Claude Code CLI
- Registers MCP servers with
claude mcp add - Streams agent output for real-time logging
- Extracts git diffs for patches
evaluation.py - Patch Testing¶
Handles the evaluation phase:
apply_patch(): Applies unified diff patches via gitrun_tests(): Executes pytest with SWE-bench test specsevaluate_patch(): Full evaluation workflow
Container Architecture¶
+----------------------------------------------------------------+
| Host Machine |
| +----------------------------------------------------------+ |
| | mcpbr Harness (Python) | |
| | - Loads SWE-bench tasks from HuggingFace | |
| | - Pulls pre-built Docker images | |
| | - Orchestrates agent runs | |
| | - Collects results and generates reports | |
| +----------------------------+-----------------------------+ |
| | docker exec |
| +----------------------------v-----------------------------+ |
| | Docker Container (per task) | |
| | +----------------------------------------------------+ | |
| | | Pre-built SWE-bench Image | | |
| | | - Repository at correct commit | | |
| | | - All dependencies installed (astropy, django...) | | |
| | | - Conda environment with testbed | | |
| | +----------------------------------------------------+ | |
| | | |
| | Runtime Setup: | |
| | - Node.js installed | |
| | - Claude CLI installed globally | |
| | - Non-root user (mcpbr) created | |
| | | |
| | Agent Execution: | |
| | - Claude CLI runs as mcpbr user | |
| | - Makes API calls to Anthropic | |
| | - Executes Bash commands (imports work!) | |
| | - Reads/writes files | |
| | - Generates patches | |
| | | |
| | Evaluation: | |
| | - Patches applied via git | |
| | - pytest runs in conda testbed environment | |
| +----------------------------------------------------------+ |
+----------------------------------------------------------------+
Why Run Inside Docker?¶
The agent (Claude Code CLI) runs inside the Docker container rather than on the host. This design choice provides:
- Working Imports: Python imports work correctly (e.g.,
from astropy import ...) - Test Execution: The agent can run tests and verify fixes
- No Conflicts: No dependency conflicts with the host machine
- Reproducibility: Identical environment across runs
Pre-built Images¶
mcpbr uses pre-built SWE-bench Docker images from Epoch AI's registry:
These images contain:
- Repository checked out at the correct (buggy) commit
- All project dependencies pre-installed and validated
- Conda environment named
testbedwith correct Python version
Fallback Path¶
If a pre-built image isn't available, mcpbr falls back to:
- Using a generic Python 3.11 image
- Cloning the repository at the correct commit
- Attempting dependency installation (less reliable)
Protocol-Based Design¶
mcpbr uses Python Protocols for extensibility:
@runtime_checkable
class AgentHarness(Protocol):
async def solve(
self,
task: dict[str, Any],
workdir: str,
timeout: int = 300,
verbose: bool = False,
task_id: str | None = None,
env: TaskEnvironment | None = None,
) -> AgentResult:
...
This allows future addition of:
- New agent backends (e.g., other coding agents)
- New LLM providers
- Custom evaluation pipelines
Signal Handling¶
mcpbr registers signal handlers for graceful cleanup:
def register_signal_handlers() -> None:
signal.signal(signal.SIGINT, _signal_handler)
signal.signal(signal.SIGTERM, _signal_handler)
On interrupt:
- Running agents are terminated
- Docker containers are stopped and removed
- Temporary directories are cleaned up
This prevents orphaned containers from accumulating.
Next Steps¶
- MCP Integration - How to test your MCP server
- API Reference - Detailed module documentation
- Contributing - How to extend mcpbr