API Reference¶

Q: What is the mcpbr Python API?

mcpbr provides a Python API for programmatic access to evaluation functionality, including configuration management, harness creation, Docker environment management, and result processing.

Q: How do I run mcpbr evaluations programmatically?

Import the run_evaluation function from mcpbr.harness, create a HarnessConfig object, and call await run_evaluation(config). This returns an EvaluationResults object with all task results.

Q: What protocols does mcpbr define for extensibility?

mcpbr defines AgentHarness (for agent backends) and ModelProvider (for LLM providers) protocols, allowing you to add custom implementations by creating classes that implement these interfaces.

This page documents the mcpbr Python API for programmatic usage.

Quick Example¶

import asyncio
from mcpbr.config import HarnessConfig, MCPServerConfig, load_config
from mcpbr.harness import run_evaluation

async def main():
    # Load config from file
    config = load_config("mcpbr.yaml")

    # Or create programmatically
    config = HarnessConfig(
        mcp_server=MCPServerConfig(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
        ),
        model="sonnet",
        sample_size=5,
    )

    # Run evaluation
    results = await run_evaluation(
        config=config,
        run_mcp=True,
        run_baseline=True,
        verbose=True,
    )

    # Process results
    print(f"MCP resolved: {results.summary['mcp']['resolved']}")
    print(f"Baseline resolved: {results.summary['baseline']['resolved']}")

asyncio.run(main())

Configuration¶

MCPServerConfig¶

`MCPServerConfig` ¶

Bases: BaseModel

Configuration for an MCP server.

`name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)')` `class-attribute` `instance-attribute` ¶

`command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')")` `class-attribute` `instance-attribute` ¶

`args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.')` `class-attribute` `instance-attribute` ¶

`env = Field(default_factory=dict, description='Environment variables for the MCP server')` `class-attribute` `instance-attribute` ¶

`get_args_for_workdir(workdir)` ¶

Replace {workdir} placeholder in args with actual path.

`get_expanded_env()` ¶

Expand ${VAR} references in env values using os.environ.

Returns:

Type	Description
`dict[str, str]`	Dictionary with environment variables expanded.

HarnessConfig¶

`HarnessConfig` ¶

Bases: BaseModel

Main configuration for the test harness.

Supports multiple model providers and agent harnesses.

`validate_model_for_provider()` ¶

Validate model ID based on the provider.

Anthropic provider accepts any model ID (direct API).

load_config¶

`load_config(config_path)` ¶

Load configuration from a YAML file.

Parameters:

Name	Type	Description	Default
`config_path`	`str \| Path`	Path to the YAML configuration file.	required

Returns:

Type	Description
`HarnessConfig`	Validated HarnessConfig instance.

Raises:

Type	Description
`FileNotFoundError`	If config file doesn't exist.
`ValueError`	If config is invalid.

Harness¶

run_evaluation¶

`run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None)` `async` ¶

Run the full evaluation.

Parameters:

Name	Type	Description	Default
`config`	`HarnessConfig`	Harness configuration.	required
`run_mcp`	`bool`	Whether to run MCP evaluation.	`True`
`run_baseline`	`bool`	Whether to run baseline evaluation.	`True`
`verbose`	`bool`	Enable verbose output.	`False`
`verbosity`	`int`	Verbosity level (0=silent, 1=summary, 2=detailed).	`1`
`log_file`	`TextIO \| None`	Optional file handle for writing raw JSON logs.	`None`
`log_dir`	`Path \| None`	Optional directory for per-instance JSON log files.	`None`
`task_ids`	`list[str] \| None`	Specific task IDs to run (None for all).	`None`

Returns:

Type	Description
`EvaluationResults`	EvaluationResults with all results.

EvaluationResults¶

`EvaluationResults` `dataclass` ¶

Complete evaluation results.

TaskResult¶

`TaskResult` `dataclass` ¶

Result for a single task.

Agent Harnesses¶

AgentHarness Protocol¶

`AgentHarness` ¶

Bases: Protocol

Protocol for agent harnesses that solve SWE-bench tasks.

To add a new harness: 1. Create a class implementing this protocol 2. Add it to HARNESS_REGISTRY 3. Add the harness name to VALID_HARNESSES in config.py

`solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None)` `async` ¶

Solve a SWE-bench task and return the patch.

Parameters:

Name	Type	Description	Default
`task`	`dict[str, Any]`	SWE-bench task dictionary with problem_statement, etc.	required
`workdir`	`str`	Path to the repository working directory.	required
`timeout`	`int`	Timeout in seconds.	`300`
`verbose`	`bool`	If True, stream output to console.	`False`
`task_id`	`str \| None`	Task identifier for prefixing output.	`None`
`env`	`TaskEnvironment \| None`	Optional Docker environment to run inside.	`None`

Returns:

Type	Description
`AgentResult`	AgentResult with the generated patch.

AgentResult¶

`AgentResult` `dataclass` ¶

Result from an agent run.

create_harness¶

`create_harness(harness_name, model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)` ¶

Factory function to create an agent harness.

Parameters:

Name	Type	Description	Default
`harness_name`	`str`	Name of the harness (currently only 'claude-code').	required
`model`	`str \| None`	Optional model override.	`None`
`mcp_server`	`MCPServerConfig \| None`	MCP server configuration (used by claude-code harness).	`None`
`prompt`	`str \| None`	Custom prompt template. Use {problem_statement} placeholder.	`None`
`max_iterations`	`int`	Maximum agent iterations (used by claude-code harness).	`10`
`verbosity`	`int`	Verbosity level for logging (0=silent, 1=summary, 2=detailed).	`1`
`log_file`	`TextIO \| InstanceLogWriter \| None`	Optional file handle for writing raw JSON logs.	`None`

Returns:

Type	Description
`AgentHarness`	AgentHarness instance.

Raises:

Type	Description
`ValueError`	If harness_name is not recognized.

ClaudeCodeHarness¶

`ClaudeCodeHarness` ¶

Harness that uses Claude Code CLI (claude) for solving tasks.

`init(model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)` ¶

Initialize Claude Code harness.

Parameters:

Name	Type	Description	Default
`model`	`str \| None`	Optional model override.	`None`
`mcp_server`	`MCPServerConfig \| None`	MCP server configuration to use.	`None`
`prompt`	`str \| None`	Custom prompt template. Use {problem_statement} placeholder.	`None`
`max_iterations`	`int`	Maximum number of agentic turns.	`10`
`verbosity`	`int`	Verbosity level (0=silent, 1=summary, 2=detailed).	`1`
`log_file`	`TextIO \| InstanceLogWriter \| None`	Optional file handle for writing raw JSON logs.	`None`

`solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None)` `async` ¶

Solve task using Claude Code CLI.

If env is provided and has claude_cli_installed=True, runs inside Docker. Otherwise runs locally on the host.

Docker Environment¶

DockerEnvironmentManager¶

`DockerEnvironmentManager` ¶

Manages Docker environments for SWE-bench tasks.

`init(use_prebuilt=True)` ¶

Initialize the Docker environment manager.

Parameters:

Name	Type	Description	Default
`use_prebuilt`	`bool`	If True, try to use pre-built SWE-bench images first.	`True`

`create_environment(task)` `async` ¶

Create an isolated environment for a SWE-bench task.

Parameters:

Name	Type	Description	Default
`task`	`dict[str, Any]`	SWE-bench task dictionary with repo, base_commit, etc.	required

Returns:

Type	Description
`TaskEnvironment`	TaskEnvironment instance.

`cleanup_all_sync()` ¶

Synchronously clean up all containers and temporary directories.

Used by signal handlers and atexit.

`cleanup_all()` `async` ¶

Clean up all containers and temporary directories.

TaskEnvironment¶

`TaskEnvironment` `dataclass` ¶

Represents an isolated environment for a SWE-bench task.

`container` `instance-attribute` ¶

`workdir` `instance-attribute` ¶

`host_workdir` `instance-attribute` ¶

`instance_id` `instance-attribute` ¶

`uses_prebuilt = field(default=False)` `class-attribute` `instance-attribute` ¶

`claude_cli_installed = field(default=False)` `class-attribute` `instance-attribute` ¶

`exec_command(command, timeout=60, workdir=None, environment=None)` `async` ¶

Execute a command in the container.

Parameters:

Name	Type	Description	Default
`command`	`str \| list[str]`	Command to execute (string or list).	required
`timeout`	`int`	Timeout in seconds.	`60`
`workdir`	`str \| None`	Working directory (defaults to /workspace).	`None`
`environment`	`dict[str, str] \| None`	Optional environment variables to set.	`None`

Returns:

Type	Description
`tuple[int, str, str]`	Tuple of (exit_code, stdout, stderr).

`exec_command_streaming(command, workdir=None, environment=None, timeout=300, on_stdout=None, on_stderr=None)` `async` ¶

Execute a command in the container with streaming output.

Parameters:

Name	Type	Description	Default
`command`	`list[str]`	Command to execute as list.	required
`workdir`	`str \| None`	Working directory (defaults to self.workdir).	`None`
`environment`	`dict[str, str] \| None`	Optional environment variables to set.	`None`
`timeout`	`int`	Timeout in seconds.	`300`
`on_stdout`	`Any`	Optional callback for stdout lines (receives str).	`None`
`on_stderr`	`Any`	Optional callback for stderr lines (receives str).	`None`

Returns:

Type	Description
`tuple[int, str, str]`	Tuple of (exit_code, stdout, stderr).

`write_file(path, content, workdir=None)` `async` ¶

Write content to a file in the container.

Parameters:

Name	Type	Description	Default
`path`	`str`	File path (relative to workdir).	required
`content`	`str`	Content to write.	required
`workdir`	`str \| None`	Working directory. If different from /workspace (the host mount), writes directly into container via docker exec.	`None`

`read_file(path)` `async` ¶

Read content from a file in the container.

`cleanup()` `async` ¶

Stop and remove the container.

Evaluation¶

evaluate_patch¶

`evaluate_patch(env, task, patch, test_timeout=120)` `async` ¶

Evaluate a patch against a SWE-bench task.

Parameters:

Name	Type	Description	Default
`env`	`TaskEnvironment`	Docker environment.	required
`task`	`dict[str, Any]`	SWE-bench task dictionary.	required
`patch`	`str`	Unified diff patch to evaluate.	required
`test_timeout`	`int`	Timeout for each test.	`120`

Returns:

Type	Description
`EvaluationResult`	EvaluationResult with full evaluation details.

EvaluationResult¶

`EvaluationResult` `dataclass` ¶

Complete evaluation result for a task.

TestResults¶

`TestResults` `dataclass` ¶

Results from running tests.

Models¶

ModelInfo¶

`ModelInfo` `dataclass` ¶

Information about a supported model.

list_supported_models¶

`list_supported_models()` ¶

Get a list of all supported models.

Returns:

Type	Description
`list[ModelInfo]`	List of ModelInfo objects.

get_model_info¶

`get_model_info(model_id)` ¶

Get information about a model.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Anthropic model ID.	required

Returns:

Type	Description
`ModelInfo \| None`	ModelInfo if found, None otherwise.

is_model_supported¶

`is_model_supported(model_id)` ¶

Check if a model is in the supported list.

Parameters:

Name	Type	Description	Default
`model_id`	`str`	Anthropic model ID.	required

Returns:

Type	Description
`bool`	True if the model is supported.

Constants¶

Default Values¶

from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES

print(DEFAULT_MODEL)       # "sonnet"
print(VALID_PROVIDERS)     # ("anthropic",)
print(VALID_HARNESSES)     # ("claude-code",)

Docker Registry¶

from mcpbr.docker_env import SWEBENCH_IMAGE_REGISTRY

# Pre-built images from Epoch AI
print(SWEBENCH_IMAGE_REGISTRY)
# "ghcr.io/epoch-research/swe-bench.eval"

API Reference¶

Quick Example¶

Configuration¶

MCPServerConfig¶

MCPServerConfig ¶

name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)') class-attribute instance-attribute ¶

command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')") class-attribute instance-attribute ¶

args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.') class-attribute instance-attribute ¶

env = Field(default_factory=dict, description='Environment variables for the MCP server') class-attribute instance-attribute ¶

get_args_for_workdir(workdir) ¶

get_expanded_env() ¶

HarnessConfig¶

HarnessConfig ¶

validate_model_for_provider() ¶

load_config¶

load_config(config_path) ¶

Harness¶

run_evaluation¶

run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None) async ¶

EvaluationResults¶

EvaluationResults dataclass ¶

TaskResult¶

TaskResult dataclass ¶

Agent Harnesses¶

AgentHarness Protocol¶

AgentHarness ¶

solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async ¶

AgentResult¶

AgentResult dataclass ¶

create_harness¶

create_harness(harness_name, model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None) ¶

ClaudeCodeHarness¶

ClaudeCodeHarness ¶

__init__(model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None) ¶

solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async ¶

Docker Environment¶

DockerEnvironmentManager¶

DockerEnvironmentManager ¶

__init__(use_prebuilt=True) ¶

create_environment(task) async ¶

cleanup_all_sync() ¶

cleanup_all() async ¶

TaskEnvironment¶

TaskEnvironment dataclass ¶

container instance-attribute ¶

workdir instance-attribute ¶

host_workdir instance-attribute ¶

instance_id instance-attribute ¶

uses_prebuilt = field(default=False) class-attribute instance-attribute ¶

claude_cli_installed = field(default=False) class-attribute instance-attribute ¶

exec_command(command, timeout=60, workdir=None, environment=None) async ¶

exec_command_streaming(command, workdir=None, environment=None, timeout=300, on_stdout=None, on_stderr=None) async ¶

write_file(path, content, workdir=None) async ¶

read_file(path) async ¶

cleanup() async ¶

Evaluation¶

evaluate_patch¶

evaluate_patch(env, task, patch, test_timeout=120) async ¶

EvaluationResult¶

EvaluationResult dataclass ¶

TestResults¶

TestResults dataclass ¶

Models¶

ModelInfo¶

ModelInfo dataclass ¶

list_supported_models¶

list_supported_models() ¶

get_model_info¶

get_model_info(model_id) ¶

is_model_supported¶

is_model_supported(model_id) ¶

Constants¶

Default Values¶

Docker Registry¶

`MCPServerConfig` ¶

`name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)')` `class-attribute` `instance-attribute` ¶

`command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')")` `class-attribute` `instance-attribute` ¶

`args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.')` `class-attribute` `instance-attribute` ¶

`env = Field(default_factory=dict, description='Environment variables for the MCP server')` `class-attribute` `instance-attribute` ¶

`get_args_for_workdir(workdir)` ¶

`get_expanded_env()` ¶

`HarnessConfig` ¶

`validate_model_for_provider()` ¶

`load_config(config_path)` ¶

`run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None)` `async` ¶

`EvaluationResults` `dataclass` ¶

`TaskResult` `dataclass` ¶

`AgentHarness` ¶

`solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None)` `async` ¶

`AgentResult` `dataclass` ¶

`create_harness(harness_name, model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)` ¶

`ClaudeCodeHarness` ¶

`init(model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)` ¶

`solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None)` `async` ¶

`DockerEnvironmentManager` ¶

`init(use_prebuilt=True)` ¶

`create_environment(task)` `async` ¶

`cleanup_all_sync()` ¶

`cleanup_all()` `async` ¶

`TaskEnvironment` `dataclass` ¶

`container` `instance-attribute` ¶

`workdir` `instance-attribute` ¶

`host_workdir` `instance-attribute` ¶

`instance_id` `instance-attribute` ¶

`uses_prebuilt = field(default=False)` `class-attribute` `instance-attribute` ¶

`claude_cli_installed = field(default=False)` `class-attribute` `instance-attribute` ¶

`exec_command(command, timeout=60, workdir=None, environment=None)` `async` ¶

`exec_command_streaming(command, workdir=None, environment=None, timeout=300, on_stdout=None, on_stderr=None)` `async` ¶

`write_file(path, content, workdir=None)` `async` ¶

`read_file(path)` `async` ¶

`cleanup()` `async` ¶

`evaluate_patch(env, task, patch, test_timeout=120)` `async` ¶

`EvaluationResult` `dataclass` ¶

`TestResults` `dataclass` ¶

`ModelInfo` `dataclass` ¶

`list_supported_models()` ¶

`get_model_info(model_id)` ¶

`is_model_supported(model_id)` ¶