API Reference¶
This page documents the mcpbr Python API for programmatic usage.
Quick Example¶
import asyncio
from mcpbr.config import HarnessConfig, MCPServerConfig, load_config
from mcpbr.harness import run_evaluation
async def main():
# Load config from file
config = load_config("mcpbr.yaml")
# Or create programmatically
config = HarnessConfig(
mcp_server=MCPServerConfig(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"],
),
model="sonnet",
sample_size=5,
)
# Run evaluation
results = await run_evaluation(
config=config,
run_mcp=True,
run_baseline=True,
verbose=True,
)
# Process results
print(f"MCP resolved: {results.summary['mcp']['resolved']}")
print(f"Baseline resolved: {results.summary['baseline']['resolved']}")
asyncio.run(main())
Configuration¶
MCPServerConfig¶
MCPServerConfig ¶
Bases: BaseModel
Configuration for an MCP server.
name = Field(default='mcpbr', description='Name to register the MCP server as (appears in tool names)') class-attribute instance-attribute ¶
command = Field(description="Command to start the MCP server (e.g., 'npx', 'uvx', 'python')") class-attribute instance-attribute ¶
args = Field(default_factory=list, description='Arguments to pass to the command. Use {workdir} as placeholder.') class-attribute instance-attribute ¶
env = Field(default_factory=dict, description='Environment variables for the MCP server') class-attribute instance-attribute ¶
get_args_for_workdir(workdir) ¶
Replace {workdir} placeholder in args with actual path.
get_expanded_env() ¶
Expand ${VAR} references in env values using os.environ.
Returns:
| Type | Description |
|---|---|
dict[str, str] | Dictionary with environment variables expanded. |
HarnessConfig¶
HarnessConfig ¶
Bases: BaseModel
Main configuration for the test harness.
Supports multiple model providers and agent harnesses.
validate_model_for_provider() ¶
Validate model ID based on the provider.
Anthropic provider accepts any model ID (direct API).
load_config¶
load_config(config_path) ¶
Load configuration from a YAML file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_path | str | Path | Path to the YAML configuration file. | required |
Returns:
| Type | Description |
|---|---|
HarnessConfig | Validated HarnessConfig instance. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If config file doesn't exist. |
ValueError | If config is invalid. |
Harness¶
run_evaluation¶
run_evaluation(config, run_mcp=True, run_baseline=True, verbose=False, verbosity=1, log_file=None, log_dir=None, task_ids=None) async ¶
Run the full evaluation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | HarnessConfig | Harness configuration. | required |
run_mcp | bool | Whether to run MCP evaluation. | True |
run_baseline | bool | Whether to run baseline evaluation. | True |
verbose | bool | Enable verbose output. | False |
verbosity | int | Verbosity level (0=silent, 1=summary, 2=detailed). | 1 |
log_file | TextIO | None | Optional file handle for writing raw JSON logs. | None |
log_dir | Path | None | Optional directory for per-instance JSON log files. | None |
task_ids | list[str] | None | Specific task IDs to run (None for all). | None |
Returns:
| Type | Description |
|---|---|
EvaluationResults | EvaluationResults with all results. |
EvaluationResults¶
EvaluationResults dataclass ¶
Complete evaluation results.
TaskResult¶
TaskResult dataclass ¶
Result for a single task.
Agent Harnesses¶
AgentHarness Protocol¶
AgentHarness ¶
Bases: Protocol
Protocol for agent harnesses that solve SWE-bench tasks.
To add a new harness: 1. Create a class implementing this protocol 2. Add it to HARNESS_REGISTRY 3. Add the harness name to VALID_HARNESSES in config.py
solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async ¶
Solve a SWE-bench task and return the patch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task | dict[str, Any] | SWE-bench task dictionary with problem_statement, etc. | required |
workdir | str | Path to the repository working directory. | required |
timeout | int | Timeout in seconds. | 300 |
verbose | bool | If True, stream output to console. | False |
task_id | str | None | Task identifier for prefixing output. | None |
env | TaskEnvironment | None | Optional Docker environment to run inside. | None |
Returns:
| Type | Description |
|---|---|
AgentResult | AgentResult with the generated patch. |
AgentResult¶
AgentResult dataclass ¶
Result from an agent run.
create_harness¶
create_harness(harness_name, model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None) ¶
Factory function to create an agent harness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
harness_name | str | Name of the harness (currently only 'claude-code'). | required |
model | str | None | Optional model override. | None |
mcp_server | MCPServerConfig | None | MCP server configuration (used by claude-code harness). | None |
prompt | str | None | Custom prompt template. Use {problem_statement} placeholder. | None |
max_iterations | int | Maximum agent iterations (used by claude-code harness). | 10 |
verbosity | int | Verbosity level for logging (0=silent, 1=summary, 2=detailed). | 1 |
log_file | TextIO | InstanceLogWriter | None | Optional file handle for writing raw JSON logs. | None |
Returns:
| Type | Description |
|---|---|
AgentHarness | AgentHarness instance. |
Raises:
| Type | Description |
|---|---|
ValueError | If harness_name is not recognized. |
ClaudeCodeHarness¶
ClaudeCodeHarness ¶
Harness that uses Claude Code CLI (claude) for solving tasks.
__init__(model=None, mcp_server=None, prompt=None, max_iterations=10, verbosity=1, log_file=None) ¶
Initialize Claude Code harness.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model | str | None | Optional model override. | None |
mcp_server | MCPServerConfig | None | MCP server configuration to use. | None |
prompt | str | None | Custom prompt template. Use {problem_statement} placeholder. | None |
max_iterations | int | Maximum number of agentic turns. | 10 |
verbosity | int | Verbosity level (0=silent, 1=summary, 2=detailed). | 1 |
log_file | TextIO | InstanceLogWriter | None | Optional file handle for writing raw JSON logs. | None |
solve(task, workdir, timeout=300, verbose=False, task_id=None, env=None) async ¶
Solve task using Claude Code CLI.
If env is provided and has claude_cli_installed=True, runs inside Docker. Otherwise runs locally on the host.
Docker Environment¶
DockerEnvironmentManager¶
DockerEnvironmentManager ¶
Manages Docker environments for SWE-bench tasks.
__init__(use_prebuilt=True) ¶
Initialize the Docker environment manager.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_prebuilt | bool | If True, try to use pre-built SWE-bench images first. | True |
create_environment(task) async ¶
Create an isolated environment for a SWE-bench task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task | dict[str, Any] | SWE-bench task dictionary with repo, base_commit, etc. | required |
Returns:
| Type | Description |
|---|---|
TaskEnvironment | TaskEnvironment instance. |
cleanup_all_sync() ¶
Synchronously clean up all containers and temporary directories.
Used by signal handlers and atexit.
cleanup_all() async ¶
Clean up all containers and temporary directories.
TaskEnvironment¶
TaskEnvironment dataclass ¶
Represents an isolated environment for a SWE-bench task.
container instance-attribute ¶
workdir instance-attribute ¶
host_workdir instance-attribute ¶
instance_id instance-attribute ¶
uses_prebuilt = field(default=False) class-attribute instance-attribute ¶
claude_cli_installed = field(default=False) class-attribute instance-attribute ¶
exec_command(command, timeout=60, workdir=None, environment=None) async ¶
Execute a command in the container.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
command | str | list[str] | Command to execute (string or list). | required |
timeout | int | Timeout in seconds. | 60 |
workdir | str | None | Working directory (defaults to /workspace). | None |
environment | dict[str, str] | None | Optional environment variables to set. | None |
Returns:
| Type | Description |
|---|---|
tuple[int, str, str] | Tuple of (exit_code, stdout, stderr). |
exec_command_streaming(command, workdir=None, environment=None, timeout=300, on_stdout=None, on_stderr=None) async ¶
Execute a command in the container with streaming output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
command | list[str] | Command to execute as list. | required |
workdir | str | None | Working directory (defaults to self.workdir). | None |
environment | dict[str, str] | None | Optional environment variables to set. | None |
timeout | int | Timeout in seconds. | 300 |
on_stdout | Any | Optional callback for stdout lines (receives str). | None |
on_stderr | Any | Optional callback for stderr lines (receives str). | None |
Returns:
| Type | Description |
|---|---|
tuple[int, str, str] | Tuple of (exit_code, stdout, stderr). |
write_file(path, content, workdir=None) async ¶
Write content to a file in the container.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | File path (relative to workdir). | required |
content | str | Content to write. | required |
workdir | str | None | Working directory. If different from /workspace (the host mount), writes directly into container via docker exec. | None |
read_file(path) async ¶
Read content from a file in the container.
cleanup() async ¶
Stop and remove the container.
Evaluation¶
evaluate_patch¶
evaluate_patch(env, task, patch, test_timeout=120) async ¶
Evaluate a patch against a SWE-bench task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
env | TaskEnvironment | Docker environment. | required |
task | dict[str, Any] | SWE-bench task dictionary. | required |
patch | str | Unified diff patch to evaluate. | required |
test_timeout | int | Timeout for each test. | 120 |
Returns:
| Type | Description |
|---|---|
EvaluationResult | EvaluationResult with full evaluation details. |
EvaluationResult¶
EvaluationResult dataclass ¶
Complete evaluation result for a task.
TestResults¶
TestResults dataclass ¶
Results from running tests.
Models¶
ModelInfo¶
ModelInfo dataclass ¶
Information about a supported model.
list_supported_models¶
list_supported_models() ¶
Get a list of all supported models.
Returns:
| Type | Description |
|---|---|
list[ModelInfo] | List of ModelInfo objects. |
get_model_info¶
get_model_info(model_id) ¶
Get information about a model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id | str | Anthropic model ID. | required |
Returns:
| Type | Description |
|---|---|
ModelInfo | None | ModelInfo if found, None otherwise. |
is_model_supported¶
is_model_supported(model_id) ¶
Check if a model is in the supported list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_id | str | Anthropic model ID. | required |
Returns:
| Type | Description |
|---|---|
bool | True if the model is supported. |
Constants¶
Default Values¶
from mcpbr.models import DEFAULT_MODEL
from mcpbr.config import VALID_PROVIDERS, VALID_HARNESSES
print(DEFAULT_MODEL) # "sonnet"
print(VALID_PROVIDERS) # ("anthropic",)
print(VALID_HARNESSES) # ("claude-code",)