Frequently Asked Questions (FAQ)¶
Quick answers to common questions about mcpbr. For detailed information, follow the links to the full documentation.
New to mcpbr? Check out the Best Practices Guide for tips on getting the most value from your evaluations.
Getting Started¶
What is mcpbr?¶
mcpbr (Model Context Protocol Benchmark Runner) is a tool for evaluating MCP servers against real GitHub issues from benchmarks like SWE-bench and CyberGym. It provides quantitative comparison between tool-assisted and baseline agent performance, helping you prove whether your MCP server actually improves AI coding capabilities.
How do I get started with mcpbr?¶
- Install mcpbr:
pip install mcpbr - Set your API key:
export ANTHROPIC_API_KEY="sk-ant-..." - Generate a config:
mcpbr init - Run an evaluation:
mcpbr run -c mcpbr.yaml -n 1 -v
See the Installation Guide for detailed setup instructions.
What are the prerequisites?¶
- Python 3.11+ - Required for mcpbr
- Docker - Must be running (verify with
docker info) - Claude Code CLI - Install with
npm install -g @anthropic-ai/claude-code - Anthropic API key - Get one at console.anthropic.com
- Network access - For pulling Docker images and API calls
See Prerequisites for more details.
Which models does mcpbr support?¶
mcpbr supports all Claude models from Anthropic:
- Claude Opus 4.5 - Alias:
opusorclaude-opus-4-5-20251101 - Claude Sonnet 4.5 - Alias:
sonnetorclaude-sonnet-4-5-20250929(recommended) - Claude Haiku 4.5 - Alias:
haikuorclaude-haiku-4-5-20251001
Run mcpbr models to see the full list. See Supported Models.
Does mcpbr work on Apple Silicon Macs?¶
Yes! mcpbr works on M1/M2/M3 Macs using x86_64 Docker images via emulation. This may be slower than native ARM64 but ensures compatibility with all SWE-bench tasks. Install Rosetta 2 for best performance:
See Apple Silicon Notes.
How do I install the Claude Code CLI?¶
Verify the installation:
See Claude Code CLI Installation.
Installation & Setup¶
How do I install mcpbr?¶
From PyPI (recommended):
From source:
See Installation Methods.
How do I verify my installation?¶
# Check mcpbr version
mcpbr --version
# List supported models
mcpbr models
# Generate a test config
mcpbr init -o test-config.yaml
See Verify Installation.
How do I set my API key?¶
Set the ANTHROPIC_API_KEY environment variable:
Add to your shell profile (.bashrc, .zshrc) for persistence:
Configuration¶
How do I create a configuration file?¶
Use the init command to generate a starter config:
For specific MCP servers, use templates:
# List available templates
mcpbr templates
# Use a specific template
mcpbr init -t filesystem
# Interactive template selection
mcpbr init -i
How do I configure my MCP server?¶
Edit the mcp_server section in your config file:
mcp_server:
command: "npx"
args:
- "-y"
- "@modelcontextprotocol/server-filesystem"
- "{workdir}"
env: {}
- command: Executable to run (e.g.,
npx,python,node) - args: Command arguments. Use
{workdir}as placeholder for the task repository path - env: Environment variables for the server
What is the {workdir} placeholder?¶
{workdir} is replaced at runtime with the path to the task repository inside the Docker container (typically /workspace). This allows your MCP server to access the codebase.
See The {workdir} Placeholder.
How do I use environment variables in config?¶
Reference environment variables using ${VAR_NAME} syntax:
mcp_server:
command: "npx"
args: ["-y", "@supermodeltools/mcp-server"]
env:
SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
The variable will be expanded from your shell environment at runtime.
What configuration parameters are available?¶
Key parameters:
- mcp_server - MCP server command, args, and environment
- provider - LLM provider (
anthropic) - agent_harness - Agent backend (
claude-code) - model - Model alias or full ID (
sonnet,opus,haiku) - benchmark - Benchmark to run (
swe-benchorcybergym) - dataset - HuggingFace dataset (optional, benchmark provides default)
- sample_size - Number of tasks (null = full dataset)
- timeout_seconds - Timeout per task (default: 300)
- max_concurrent - Parallel task limit (default: 4)
- max_iterations - Max agent turns per task (default: 10)
How do I customize the agent prompt?¶
Use the agent_prompt field:
agent_prompt: |
Fix the following bug in this repository:
{problem_statement}
Make the minimal changes necessary to fix the issue.
Use {problem_statement} as a placeholder for the task description. You can also override via CLI:
See Custom Agent Prompt.
Benchmark Selection¶
What benchmarks does mcpbr support?¶
mcpbr supports two benchmarks:
- SWE-bench - Bug fixing in Python repositories, evaluated with test suites
- CyberGym - Security exploit generation in C/C++ projects, evaluated by crash detection
Run mcpbr benchmarks to list available benchmarks.
See Benchmarks Guide.
How do I choose between SWE-bench and CyberGym?¶
- SWE-bench: Use for testing code exploration, bug fixing, and general software engineering tasks
- CyberGym: Use for security research, vulnerability analysis, and exploit generation
# Run SWE-bench (default)
mcpbr run -c config.yaml
# Run CyberGym
mcpbr run -c config.yaml --benchmark cybergym --level 2
See Comparing Benchmarks.
What are CyberGym difficulty levels?¶
CyberGym supports 4 difficulty levels (0-3) that control context given to the agent:
- Level 0: Minimal context (project name and bug ID only) - hardest
- Level 1: Adds vulnerability type information
- Level 2: Includes vulnerability type and description
- Level 3: Maximum context with detailed instructions - easiest
See Difficulty Levels.
How many tasks should I run?¶
Start small, then scale up:
- Development/Testing: 1-5 tasks
- Validation: 10-25 tasks
- Comprehensive: 50-100 tasks or full dataset (null sample_size)
# Quick test
mcpbr run -c config.yaml -n 1 -v
# Validation run
mcpbr run -c config.yaml -n 25
# Full dataset
mcpbr run -c config.yaml
Can I run specific tasks?¶
Yes, use the -t or --task flag:
You can specify multiple task IDs.
Running Evaluations¶
How do I run my first evaluation?¶
See Quick Start.
What do the run flags mean?¶
Common flags:
-c, --config PATH- Path to YAML configuration file (required)-n, --sample INTEGER- Number of tasks to run-v, --verbose- Verbose output (-vvfor very verbose)-M, --mcp-only- Run only MCP evaluation (skip baseline)-B, --baseline-only- Run only baseline evaluation (skip MCP)-o, --output PATH- Save JSON results-r, --report PATH- Save Markdown report--log-dir PATH- Directory for per-instance logs
See CLI Reference.
How long does an evaluation take?¶
Depends on several factors:
- Task complexity: 2-10 minutes per task on average
- Sample size: 1 task vs. 300 tasks
- Timeout setting: Default 300s (5 min) per task
- Concurrency: 4 parallel tasks (default)
- Platform: Slower on Apple Silicon (emulation)
Example: 25 tasks with 4 concurrent = ~30-60 minutes total
How do I speed up evaluations?¶
-
Increase concurrency:
-
Use a faster model:
-
Reduce sample size for testing:
-
Use pre-built Docker images (enabled by default)
See Performance Issues.
Can I run only the MCP agent or baseline?¶
Yes:
This is useful for testing your MCP server without waiting for baseline results.
How do I pause or resume evaluations?¶
mcpbr doesn't currently support pause/resume natively, but you can:
- Use
--taskto run specific tasks - Save results with
--output results.json - Run remaining tasks separately
- Manually combine results
Future versions may include automatic checkpoint/resume.
MCP Server Setup¶
How do I test my MCP server with mcpbr?¶
-
Configure your server in the config file:
-
Test standalone first:
-
Run a quick mcpbr test:
See Testing Your Server.
What MCP servers work with mcpbr?¶
Any MCP server that exposes tools for file operations, code search, or codebase analysis can be tested. Common examples:
- Anthropic filesystem server - Basic file operations
- Custom Python servers - Domain-specific tools
- Supermodel - Codebase analysis and semantic search
- Custom Node.js servers - API integrations
How does mcpbr register MCP servers?¶
mcpbr uses the Claude Code CLI's claude mcp add command to register your MCP server before each agent run. Tools from your server appear with the mcp__ prefix (e.g., mcp__read_file).
See How mcpbr Uses MCP.
Why would I use an MCP server vs. built-in tools?¶
MCP servers can provide specialized capabilities:
- Semantic code search - Beyond simple text grep
- Codebase indexing - Fast symbol lookup and references
- AST analysis - Parse and analyze code structure
- Domain-specific operations - Custom tools for your use case
- API integrations - External data sources
These can improve the agent's ability to understand and fix bugs compared to basic file operations.
My MCP server isn't starting - how do I debug it?¶
-
Test the server independently:
-
Check environment variables are set:
-
Verify the command exists:
-
Check mcpbr logs with verbose output:
See Server Not Starting.
How do I check if my MCP tools are being used?¶
-
Run with verbose output:
-
Look for tool calls like:
-
Check tool usage in JSON results:
See Check Tool Usage.
Result Interpretation¶
What does "resolved" mean?¶
A task is resolved when:
- The agent generated a patch/solution
- The patch applied cleanly to the repository
- All FAIL_TO_PASS tests pass (tests that should pass after the fix)
- All PASS_TO_PASS tests pass (regression tests that should remain passing)
For CyberGym: PoC crashes pre-patch AND doesn't crash post-patch.
How do I interpret the improvement percentage?¶
The improvement shows how much better the MCP agent performed relative to baseline:
Example: If MCP resolves 32% and baseline resolves 20%:
- Positive: MCP agent performed better
- Negative: Baseline performed better
- ~0%: Similar performance
What output formats are available?¶
- Console - Real-time progress and summary tables
- JSON (
--output) - Structured data for programmatic analysis - YAML (
--output-yaml) - Human-readable structured format - Markdown (
--report) - Report for team reviews - JUnit XML (
--output-junit) - For CI/CD integration - Per-instance logs (
--log-dir) - Detailed debugging information
See Understanding Evaluation Results.
Where can I find detailed logs?¶
Use the --log-dir flag to save per-instance logs:
This creates timestamped JSON files with full tool call traces for each task:
logs/
astropy__astropy-12907_mcp_20260117_143052.json
astropy__astropy-12907_baseline_20260117_143156.json
See Per-Instance Logs.
How do I analyze tool usage?¶
Check the tool_usage field in JSON results:
{
"tool_usage": {
"mcp__mcpbr__read_file": 15,
"mcp__mcpbr__search_files": 8,
"Bash": 27,
"Read": 22
}
}
Low MCP tool usage may indicate: - Tools not helpful for the task - Better built-in alternatives available - Tool discovery or registration issues
See Tool Usage Analysis.
What if MCP and baseline have similar rates?¶
If both agents perform similarly:
- MCP tools may not provide additional value for these specific tasks
- Built-in tools may be sufficient
- Review tool usage to see if MCP tools are actually being used
- Consider testing on different tasks or benchmarks
See Common Patterns.
Troubleshooting¶
Docker is not running - how do I fix this?¶
Start Docker:
- macOS:
open -a Docker - Linux:
sudo systemctl start docker - Windows: Start Docker Desktop from Start menu
Verify with:
See Docker Not Running.
Claude CLI not found - what should I do?¶
Install the Claude Code CLI:
Verify installation:
If installed but not found, add npm globals to PATH:
See Claude CLI Not Found.
Why is mcpbr slow on my Apple Silicon Mac?¶
mcpbr uses x86_64 Docker images for compatibility, which run via emulation on ARM64 Macs. This is normal and expected behavior.
To optimize:
-
Install Rosetta 2:
-
Reduce concurrency:
-
Increase timeouts:
Tasks are timing out - what should I do?¶
Increase the timeout in your config:
Or reduce iterations if the agent is looping:
For testing, use a faster model:
See Task Timeouts.
Pre-built Docker image not found - is this a problem?¶
mcpbr will fall back to building from scratch, which is less reliable but usually works. This warning is informational.
You can:
-
Manually pull the image:
-
Or disable pre-built images:
See Pre-built Image Not Found.
How do I clean up Docker containers?¶
Use the cleanup command:
# Preview what would be removed
mcpbr cleanup --dry-run
# Remove orphaned containers
mcpbr cleanup
# Skip confirmation
mcpbr cleanup -f
See Orphaned Docker Resources.
API key is not working - how do I check?¶
Verify the key is set correctly:
The key should: - Start with sk-ant- - Have no extra whitespace - Be exported in your current shell session
Re-export if needed:
See API Key Not Set.
Performance & Optimization¶
How do I optimize evaluation performance?¶
-
Increase concurrency (if you have resources):
-
Use pre-built images (enabled by default):
-
Use faster models for testing:
-
Reduce sample size during development:
-
On Apple Silicon, reduce concurrency to avoid resource contention:
What's the difference between models?¶
- Opus 4.5 - Most capable, highest cost, slowest
- Sonnet 4.5 - Balanced performance and cost (recommended)
- Haiku 4.5 - Fastest and cheapest, good for testing
For production evaluations, use Sonnet or Opus. For development, Haiku is sufficient.
How do I reduce costs?¶
-
Use Haiku for development:
-
Test with smaller samples:
-
Reduce max iterations:
-
Use shorter timeouts:
-
Run only MCP or baseline:
How much does an evaluation cost?¶
Costs depend on: - Model used (Haiku < Sonnet < Opus) - Number of tasks (25 vs. 300) - Task complexity (tokens per task) - Iterations (max_iterations setting)
Rough estimates for SWE-bench Lite (300 tasks) with Sonnet: - Full evaluation: ~$50-150 - 25-task sample: ~$5-15 - Single task: ~$0.20-0.60
Use --output-yaml to track token usage and calculate exact costs.
Docker & Environment¶
What Docker images does mcpbr use?¶
For SWE-bench: Pre-built images from Epoch AI's registry when available:
For CyberGym: mcpbr builds custom images with compilation tools and sanitizers.
How does mcpbr use Docker?¶
- Creates an isolated container per task
- Sets up the repository at the correct commit
- Installs dependencies (or uses pre-built image)
- Runs Claude Code CLI inside the container
- Evaluates the solution
- Cleans up the container
The agent runs inside the container so Python imports and tests work correctly.
Can I run mcpbr without Docker?¶
No, Docker is required for: - Isolated task environments - Reproducible evaluations - Consistent dependencies - Safe code execution
Make sure Docker is running before starting evaluations.
How do I see Docker container logs?¶
While running:
Enable verbose mcpbr output:
Cost & Billing¶
How can I estimate costs before running?¶
Rough cost estimates per task with Sonnet: - Simple tasks: $0.20-0.40 - Average tasks: $0.40-0.80 - Complex tasks: $0.80-1.50
For a 25-task sample: ~$10-30 total For full SWE-bench Lite (300 tasks): ~$100-300 total
Start with 1-5 tasks to gauge costs for your specific configuration.
Does mcpbr track token usage?¶
Yes, the JSON output includes detailed token usage:
Save results with --output to analyze token consumption.
Can I set a budget or limit?¶
mcpbr doesn't have built-in budget limits, but you can:
-
Use
sample_sizeto limit tasks: -
Use
timeout_secondsto limit runtime per task: -
Use
max_iterationsto limit agent turns:
Monitor costs through Anthropic Console.
Are there any free tiers?¶
mcpbr itself is free and open-source. However, you need:
- Anthropic API credits - Check console.anthropic.com for current pricing
- Docker - Free for personal use
Infrastructure costs are minimal (local Docker execution).
Advanced Usage¶
Can I use mcpbr in CI/CD?¶
Yes! mcpbr supports CI/CD integration with:
- JUnit XML output for test reporting
- Exit codes for pass/fail status
- Regression detection with thresholds
- Notifications via Slack/Discord/Email
mcpbr run -c config.yaml \
--output-junit junit.xml \
--baseline-results baseline.json \
--regression-threshold 0.1 \
--slack-webhook https://hooks.slack.com/...
See CI/CD Integration for more details.
How do I compare two MCP servers?¶
-
Create separate configs:
-
Run evaluations:
-
Compare resolution rates:
See Comparing Servers.
How do I track regressions between versions?¶
Use regression detection:
# Run baseline with version 1.0
mcpbr run -c config.yaml -o baseline-v1.json
# Later, compare version 2.0
mcpbr run -c config.yaml \
--baseline-results baseline-v1.json \
--regression-threshold 0.1
This exits with code 1 if regression rate exceeds 10%, perfect for CI/CD.
See Regression Detection for more details.
Can I customize the task selection?¶
Yes, several ways:
-
Use specific tasks:
-
Use sample size:
-
Filter by repository (requires code modification currently)
How do I contribute to mcpbr?¶
We welcome contributions! Check out:
Key areas: - Output formats (CSV, XML) - Configuration templates - Documentation improvements - Bug fixes and performance optimizations
Additional Resources¶
Where can I get help?¶
- Documentation: greynewell.github.io/mcpbr
- GitHub Issues: github.com/greynewell/mcpbr/issues
- GitHub Discussions: github.com/greynewell/mcpbr/discussions
When reporting issues, include:
See Getting Help.
How do I stay updated?¶
- Star the repo: github.com/greynewell/mcpbr
- Watch releases: Get notifications for new versions
- Follow the roadmap: Project Board
- Join discussions: Share feedback and ideas
Where can I find examples?¶
- Examples: Check the
examples/directory for sample configurations - Templates: Template guide
- Documentation: Each guide includes examples
- Tests: Check
tests/directory for code examples
What's on the roadmap?¶
Major upcoming features:
- More benchmarks - HumanEval, MBPP, GAIA, SWE-bench Verified
- Better UX - Real-time dashboard, interactive wizard
- Platform expansion - NPM package, GitHub Action, Homebrew
- MCP testing suite - Coverage analysis, performance profiling
See the full roadmap for details.
Quick Reference¶
Essential Commands¶
# List available commands
mcpbr --help
# Initialize config
mcpbr init
mcpbr init -t filesystem # Use template
mcpbr init -i # Interactive
# List options
mcpbr models # Available models
mcpbr benchmarks # Available benchmarks
mcpbr templates # Configuration templates
# Run evaluation
mcpbr run -c config.yaml # Full run
mcpbr run -c config.yaml -n 5 -v # 5 tasks, verbose
mcpbr run -c config.yaml -M # MCP only
mcpbr run -c config.yaml -o results.json # Save results
# Cleanup
mcpbr cleanup --dry-run # Preview
mcpbr cleanup # Remove orphaned containers
Common Workflows¶
Quick test:
Full evaluation:
Debug MCP server:
CI/CD integration:
mcpbr run -c config.yaml --output-junit junit.xml --baseline-results baseline.json --regression-threshold 0.1
Still have questions? Check the full documentation or open an issue.