Skip to content

API Reference

This page documents the mcpbr Python API for programmatic usage.

Quick Example

import asyncio
from mcpbr.config import HarnessConfig, MCPServerConfig, load_config
from mcpbr.harness import run_evaluation

async def main():
    # Load config from file
    config = load_config("mcpbr.yaml")

    # Or create programmatically
    config = HarnessConfig(
        mcp_server=MCPServerConfig(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
        ),
        model="sonnet",
        sample_size=5,
    )

    # Run evaluation
    results = await run_evaluation(
        config=config,
        run_mcp=True,
        run_baseline=True,
        verbose=True,
    )

    # Process results
    print(f"MCP resolved: {results.summary['mcp']['resolved']}")
    print(f"Baseline resolved: {results.summary['baseline']['resolved']}")

asyncio.run(main())

Configuration

MCPServerConfig

MCPServerConfig

Bases: BaseModel

Configuration for an MCP server.

name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)') class-attribute instance-attribute

command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')") class-attribute instance-attribute

args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.') class-attribute instance-attribute

env = Field(default_factory=dict, description='Environment variables for the MCP server') class-attribute instance-attribute

get_args_for_workdir(workdir)

Replace {workdir} placeholder in args with actual path.

get_expanded_env()

Expand ${VAR} references in env values using os.environ.

Returns:

Type Description
dict[str, str]

Dictionary with environment variables expanded.

HarnessConfig

HarnessConfig

Bases: BaseModel

Main configuration for the test harness.

Supports multiple model providers and agent harnesses.

validate_model_for_provider()

Validate model ID based on the provider.

Anthropic provider accepts any model ID (direct API).

load_config

load_config(config_path)

Load configuration from a YAML file.

Parameters:

Name Type Description Default
config_path str | Path

Path to the YAML configuration file.

required

Returns:

Type Description
HarnessConfig

Validated HarnessConfig instance.

Raises:

Type Description
FileNotFoundError

If config file doesn't exist.

ValueError

If config is invalid.


Harness

run_evaluation

run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None) async

Run the full evaluation.

Parameters:

Name Type Description Default
config HarnessConfig

Harness configuration.

required
run_mcp bool

Whether to run MCP evaluation.

True
run_baseline bool

Whether to run baseline evaluation.

True
verbose bool

Enable verbose output.

False
verbosity int

Verbosity level (0=silent, 1=summary, 2=detailed).

1
log_file TextIO | None

Optional file handle for writing raw JSON logs.

None
log_dir Path | None

Optional directory for per-instance JSON log files.

None
task_ids list[str] | None

Specific task IDs to run (None for all).

None

Returns:

Type Description
EvaluationResults

EvaluationResults with all results.

EvaluationResults

EvaluationResults dataclass

Complete evaluation results.

TaskResult

TaskResult dataclass

Result for a single task.


Agent Harnesses

AgentHarness Protocol

AgentHarness

Bases: Protocol

Protocol for agent harnesses that solve SWE-bench tasks.

To add a new harness: 1. Create a class implementing this protocol 2. Add it to HARNESS_REGISTRY 3. Add the harness name to VALID_HARNESSES in config.py

solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async

Solve a SWE-bench task and return the patch.

Parameters:

Name Type Description Default
task dict[str, Any]

SWE-bench task dictionary with problem_statement, etc.

required
workdir str

Path to the repository working directory.

required
timeout int

Timeout in seconds.

300
verbose bool

If True, stream output to console.

False
task_id str | None

Task identifier for prefixing output.

None
env TaskEnvironment | None

Optional Docker environment to run inside.

None

Returns:

Type Description
AgentResult

AgentResult with the generated patch.

AgentResult

AgentResult dataclass

Result from an agent run.

create_harness

create_harness(harness_name, model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)

Factory function to create an agent harness.

Parameters:

Name Type Description Default
harness_name str

Name of the harness (currently only 'claude-code').

required
model str | None

Optional model override.

None
mcp_server MCPServerConfig | None

MCP server configuration (used by claude-code harness).

None
prompt str | None

Custom prompt template. Use {problem_statement} placeholder.

None
max_iterations int

Maximum agent iterations (used by claude-code harness).

10
verbosity int

Verbosity level for logging (0=silent, 1=summary, 2=detailed).

1
log_file TextIO | InstanceLogWriter | None

Optional file handle for writing raw JSON logs.

None

Returns:

Type Description
AgentHarness

AgentHarness instance.

Raises:

Type Description
ValueError

If harness_name is not recognized.

ClaudeCodeHarness

ClaudeCodeHarness

Harness that uses Claude Code CLI (claude) for solving tasks.

__init__(model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None)

Initialize Claude Code harness.

Parameters:

Name Type Description Default
model str | None

Optional model override.

None
mcp_server MCPServerConfig | None

MCP server configuration to use.

None
prompt str | None

Custom prompt template. Use {problem_statement} placeholder.

None
max_iterations int

Maximum number of agentic turns.

10
verbosity int

Verbosity level (0=silent, 1=summary, 2=detailed).

1
log_file TextIO | InstanceLogWriter | None

Optional file handle for writing raw JSON logs.

None

solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async

Solve task using Claude Code CLI.

If env is provided and has claude_cli_installed=True, runs inside Docker. Otherwise runs locally on the host.


Docker Environment

DockerEnvironmentManager

DockerEnvironmentManager

Manages Docker environments for SWE-bench tasks.

__init__(use_prebuilt=True)

Initialize the Docker environment manager.

Parameters:

Name Type Description Default
use_prebuilt bool

If True, try to use pre-built SWE-bench images first.

True

create_environment(task) async

Create an isolated environment for a SWE-bench task.

Parameters:

Name Type Description Default
task dict[str, Any]

SWE-bench task dictionary with repo, base_commit, etc.

required

Returns:

Type Description
TaskEnvironment

TaskEnvironment instance.

cleanup_all_sync()

Synchronously clean up all containers and temporary directories.

Used by signal handlers and atexit.

cleanup_all() async

Clean up all containers and temporary directories.

TaskEnvironment

TaskEnvironment dataclass

Represents an isolated environment for a SWE-bench task.

container instance-attribute

workdir instance-attribute

host_workdir instance-attribute

instance_id instance-attribute

uses_prebuilt = field(default=False) class-attribute instance-attribute

claude_cli_installed = field(default=False) class-attribute instance-attribute

exec_command(command, timeout=60, workdir=None, environment=None) async

Execute a command in the container.

Parameters:

Name Type Description Default
command str | list[str]

Command to execute (string or list).

required
timeout int

Timeout in seconds.

60
workdir str | None

Working directory (defaults to /workspace).

None
environment dict[str, str] | None

Optional environment variables to set.

None

Returns:

Type Description
tuple[int, str, str]

Tuple of (exit_code, stdout, stderr).

exec_command_streaming(command, workdir=None, environment=None, timeout=300, on_stdout=None, on_stderr=None) async

Execute a command in the container with streaming output.

Parameters:

Name Type Description Default
command list[str]

Command to execute as list.

required
workdir str | None

Working directory (defaults to self.workdir).

None
environment dict[str, str] | None

Optional environment variables to set.

None
timeout int

Timeout in seconds.

300
on_stdout Any

Optional callback for stdout lines (receives str).

None
on_stderr Any

Optional callback for stderr lines (receives str).

None

Returns:

Type Description
tuple[int, str, str]

Tuple of (exit_code, stdout, stderr).

write_file(path, content, workdir=None) async

Write content to a file in the container.

Parameters:

Name Type Description Default
path str

File path (relative to workdir).

required
content str

Content to write.

required
workdir str | None

Working directory. If different from /workspace (the host mount), writes directly into container via docker exec.

None

read_file(path) async

Read content from a file in the container.

cleanup() async

Stop and remove the container.


Evaluation

evaluate_patch

evaluate_patch(env, task, patch, test_timeout=120) async

Evaluate a patch against a SWE-bench task.

Parameters:

Name Type Description Default
env TaskEnvironment

Docker environment.

required
task dict[str, Any]

SWE-bench task dictionary.

required
patch str

Unified diff patch to evaluate.

required
test_timeout int

Timeout for each test.

120

Returns:

Type Description
EvaluationResult

EvaluationResult with full evaluation details.

EvaluationResult

EvaluationResult dataclass

Complete evaluation result for a task.

TestResults

TestResults dataclass

Results from running tests.


Models

ModelInfo

ModelInfo dataclass

Information about a supported model.

list_supported_models

list_supported_models()

Get a list of all supported models.

Returns:

Type Description
list[ModelInfo]

List of ModelInfo objects.

get_model_info

get_model_info(model_id)

Get information about a model.

Parameters:

Name Type Description Default
model_id str

Anthropic model ID.

required

Returns:

Type Description
ModelInfo | None

ModelInfo if found, None otherwise.

is_model_supported

is_model_supported(model_id)

Check if a model is in the supported list.

Parameters:

Name Type Description Default
model_id str

Anthropic model ID.

required

Returns:

Type Description
bool

True if the model is supported.


Constants

Default Values

from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES

print(DEFAULT_MODEL)       # "sonnet"
print(VALID_PROVIDERS)     # ("anthropic",)
print(VALID_HARNESSES)     # ("claude-code",)

Docker Registry

from mcpbr.docker_env import SWEBENCH_IMAGE_REGISTRY

# Pre-built images from Epoch AI
print(SWEBENCH_IMAGE_REGISTRY)
# "ghcr.io/epoch-research/swe-bench.eval"