Claude Code Plugin¶
The mcpbr Claude Code plugin makes Claude an expert at running benchmarks correctly. When you work with mcpbr in Claude Code, the plugin automatically provides specialized knowledge about commands, configuration, and best practices.
Overview¶
The plugin consists of two components:
- Plugin manifest (
.claude-plugin/plugin.json) - Registers mcpbr with Claude Code - Skills directory (
skills/) - Contains specialized instruction sets for specific tasks
When Claude Code detects the plugin in a repository, it automatically:
- Validates prerequisites before running commands
- Generates correct configuration files with required placeholders
- Uses appropriate CLI flags and options
- Provides helpful troubleshooting when issues occur
- Follows best practices without being explicitly instructed
Installation¶
The plugin is bundled with mcpbr and activated automatically when you work in a cloned repository.
Option 1: Clone the Repository (Recommended)¶
That's it! Claude Code will automatically detect the .claude-plugin/plugin.json manifest and load all skills.
Option 2: Install as a Standalone Plugin¶
If you want to use the plugin without cloning the full repository:
-
Copy the plugin files to your project:
-
Claude Code will detect the plugin next time you open the project.
Option 3: Manual Installation (Advanced)¶
For custom setups, you can manually configure the plugin:
- Create a
.claude-plugindirectory in your project root - Create
plugin.jsonwith the following structure: - Create a
skills/directory with skill subdirectories (see How It Works for details)
Skills Reference¶
The plugin includes three specialized skills for common mcpbr tasks:
1. mcpbr-eval (run-benchmark)¶
Expert at running evaluations with proper validation.
Purpose: Execute benchmark evaluations with mcpbr while validating all prerequisites and avoiding common mistakes.
Key Features:
- Checks Docker is running before starting
- Verifies API keys are set
- Validates configuration files exist and are correct
- Supports all benchmarks (SWE-bench, CyberGym, MCPToolBench++)
- Provides actionable troubleshooting for errors
When to Use: Anytime you want to run a benchmark evaluation.
Example Prompts:
"Run the SWE-bench benchmark with 10 tasks"
"Evaluate my MCP server on CyberGym level 2"
"Run a quick test with 1 task"
What the Skill Does:
- Verifies Docker is running with
docker ps - Checks for
ANTHROPIC_API_KEYenvironment variable - Ensures config file exists (runs
mcpbr initif needed) - Validates config has required
{workdir}placeholder - Constructs correct
mcpbr runcommand with appropriate flags - Monitors execution and provides troubleshooting if errors occur
Common Validations:
- Docker daemon is running
- API key is set in environment
- Config file exists and is valid YAML
- MCP server command is available (
npx,uvx,python, etc.) {workdir}placeholder is present in server args- Model and dataset names are valid
2. mcpbr-config (generate-config)¶
Generates valid mcpbr configuration files.
Purpose: Create correct YAML configuration files for MCP server benchmarking with all required fields and placeholders.
Key Features:
- Ensures critical
{workdir}placeholder is included - Validates MCP server commands exist
- Provides templates for common MCP servers
- Supports all benchmark types
- Prevents common configuration mistakes
When to Use: When creating or modifying mcpbr configuration files.
Example Prompts:
"Generate a config for my Python MCP server"
"Create a config using the filesystem server"
"Help me configure my custom MCP server"
What the Skill Does:
- Asks about your MCP server (command, args, env vars)
- Selects appropriate template (npx, uvx, python, etc.)
- Ensures
{workdir}placeholder is in args array - Validates YAML syntax is correct
- Saves config to
mcpbr.yamlor specified path - Optionally tests config with a single task
Configuration Templates:
The skill provides pre-built templates for:
- Anthropic filesystem server (
@modelcontextprotocol/server-filesystem) - Python MCP servers via uvx
- Custom Node.js servers via npx
- Direct Python execution
- Servers requiring environment variables
Critical Requirements:
{workdir}placeholder MUST be inargsarray- Command must be an executable available in PATH
- YAML indentation must use spaces (not tabs)
- Environment variable references need quotes
3. benchmark-swe-lite (swe-bench-lite)¶
Quick-start command for SWE-bench Lite evaluation.
Purpose: Streamlined way to run SWE-bench Lite with sensible defaults for quick testing and demonstrations.
Key Features:
- Pre-configured for 5-task evaluation
- Includes default output files (results.json, report.md)
- Provides runtime and cost estimates
- Perfect for testing and demos
When to Use: For quick validation or demonstrations of mcpbr functionality.
Example Prompts:
What the Skill Does:
- Checks prerequisites (Docker, API key, config)
- Runs
mcpbr runwith 5 tasks from SWE-bench Lite - Saves results to
results.jsonandreport.md - Uses verbose output for visibility
- Provides expected runtime/cost estimates
Default Command:
Expected Performance:
- Runtime: 15-30 minutes (depends on task complexity)
- Cost: $2-5 (depends on task complexity and model)
Customization Options:
- Change sample size:
-n 1(quick test) or-n 10(more thorough) - MCP-only evaluation: Add
-Mflag - Very verbose output: Use
-vvinstead of-v - Specific tasks: Use
-t <instance_id>flag
How It Works¶
Plugin Architecture¶
.claude-plugin/
└── plugin.json # Manifest that registers the plugin
skills/
├── mcpbr-eval/
│ └── SKILL.md # Instructions for running evaluations
├── mcpbr-config/
│ └── SKILL.md # Instructions for config generation
└── benchmark-swe-lite/
└── SKILL.md # Quick-start instructions
Skill File Format¶
Each skill is defined by a SKILL.md file with the following structure:
---
name: skill-name
description: Brief description of what this skill does
---
# Instructions
[Main skill content with detailed instructions]
## Critical Constraints
[Non-negotiable requirements that MUST be followed]
## Common Pitfalls
[Mistakes to avoid]
## Examples
[Usage examples and code snippets]
## Troubleshooting
[Common issues and solutions]
How Claude Uses Skills¶
When you ask Claude to perform a task in a repository with the plugin:
- Detection: Claude Code detects
.claude-plugin/plugin.json - Loading: All skills in
skills/are loaded into Claude's context - Selection: Claude identifies which skill(s) are relevant to your request
- Execution: Claude follows the skill's instructions and constraints
- Validation: Critical requirements are checked before and during execution
- Troubleshooting: If errors occur, skill provides actionable feedback
Example Flow¶
Without Plugin:
User: "Run the benchmark"
Claude: *tries `mcpbr run` without config, fails*
Claude: *forgets to check Docker, fails*
Claude: *uses wrong flags, gets errors*
With Plugin:
User: "Run the benchmark"
Claude: *checks Docker with `docker ps`*
Claude: *verifies config exists*
Claude: *validates `{workdir}` placeholder*
Claude: *constructs correct command*
Claude: *evaluation succeeds*
Troubleshooting¶
Plugin Not Detected¶
Symptom: Claude doesn't seem to know about mcpbr commands or best practices.
Solutions:
-
Verify
.claude-plugin/plugin.jsonexists: -
Check plugin.json is valid JSON:
-
Ensure skills directory exists:
-
Restart Claude Code or reload the workspace
Skills Not Working¶
Symptom: Claude makes mistakes that the skills should prevent.
Solutions:
-
Verify skill files exist:
-
Check skill files have valid frontmatter:
-
Ensure frontmatter has
nameanddescriptionfields -
Verify no syntax errors in skill content
Version Mismatch¶
Symptom: Plugin version doesn't match mcpbr version.
Solutions:
-
Check versions:
-
Sync versions automatically:
-
Manually update plugin.json version to match pyproject.toml
Custom Skills Not Loading¶
Symptom: New custom skills aren't recognized by Claude.
Solutions:
-
Verify skill directory structure:
-
Check SKILL.md has valid frontmatter with
nameanddescription -
Ensure no YAML syntax errors in frontmatter
-
Restart Claude Code after adding new skills
FAQ¶
How do I create custom skills?¶
-
Create a new directory in
skills/: -
Create
SKILL.mdwith frontmatter: -
Add tests in
tests/test_claude_plugin.py -
Run tests to validate:
Can I use the plugin with other projects?¶
Yes! The plugin is designed for mcpbr but you can adapt the pattern:
- Copy
.claude-plugin/plugin.jsonto your project - Update the
name,version, anddescriptionfields - Create custom skills in
skills/directory - Each skill teaches Claude about your project's specific commands and workflows
How do I update the plugin?¶
When pulling new mcpbr updates:
The plugin files are versioned with the repository, so updates are automatic.
For standalone installations, manually copy the updated files:
Does the plugin work offline?¶
The plugin files work offline, but mcpbr itself requires:
- Network access for Docker image pulls
- API access to Anthropic's servers
The plugin instructions are embedded in the repository and don't require external resources.
How do I disable the plugin?¶
To temporarily disable:
To re-enable:
Can I contribute new skills?¶
Yes! Contributions are welcome. To add a new skill:
- Create the skill directory and SKILL.md file
- Add comprehensive tests in
tests/test_claude_plugin.py - Update
skills/README.mdto document the new skill - Run pre-commit hooks:
pre-commit run --all-files - Submit a pull request
See the contributing guide for detailed guidelines.
What's the difference between skills and documentation?¶
Documentation (like this page) is for human readers to understand how things work.
Skills are instruction sets that Claude Code reads and follows when performing tasks. They include:
- Specific validation steps
- Common pitfalls to avoid
- Exact command formats
- Troubleshooting procedures
Think of skills as "executable documentation" that guides Claude's actions.
How do I test if the plugin is working?¶
Ask Claude to perform a task that requires domain knowledge:
If the plugin is working, Claude should:
- Check Docker is running
- Verify API key is set
- Ensure config exists
- Construct a valid command
- Execute without errors
If Claude skips these steps or makes mistakes, the plugin may not be loaded.
Are there performance implications?¶
The plugin files are small (a few KB total) and have minimal impact on performance:
- Load time: Negligible (files are read once on workspace load)
- Memory: Skills are loaded into Claude's context but don't significantly impact token usage
- Execution: Skills improve efficiency by preventing errors and reducing back-and-forth
How is version sync maintained?¶
The plugin version in .claude-plugin/plugin.json is automatically synced with pyproject.toml:
- Pre-commit hook: Runs
sync_version.pybefore each commit - Make target:
make sync-versionsyncs versions manually - CI checks: GitHub Actions verify versions match
- Build process:
make buildautomatically syncs versions
This ensures the plugin version always matches the mcpbr package version.
Version Management¶
Automatic Version Sync¶
The plugin version is kept in sync with mcpbr through automated processes:
# Manual sync
make sync-version
# Automatic sync during build
make build
# CI verification
pytest tests/test_claude_plugin.py::TestPluginManifest::test_plugin_version_matches_pyproject
Version Sync Script¶
Location: scripts/sync_version.py
The script:
- Reads version from
pyproject.toml - Updates
.claude-plugin/plugin.json - Exits with error if sync fails (for CI)
Pre-commit Hook¶
The .pre-commit-config.yaml includes a hook that automatically syncs versions:
- repo: local
hooks:
- id: sync-version
name: Sync plugin version
entry: python scripts/sync_version.py
language: system
pass_filenames: false
Testing¶
The plugin includes comprehensive tests to ensure quality:
Run All Plugin Tests¶
Test Categories¶
- Manifest Tests: Validate
plugin.jsonstructure and content - Skill Tests: Ensure skills have proper format and required content
- Version Tests: Verify version sync script and automation
- Documentation Tests: Check README mentions all skills
- Integration Tests: Validate pre-commit hooks and Makefile targets
Example Test Output¶
tests/test_claude_plugin.py::TestPluginManifest::test_plugin_json_exists PASSED
tests/test_claude_plugin.py::TestPluginManifest::test_plugin_json_valid PASSED
tests/test_claude_plugin.py::TestPluginManifest::test_plugin_version_matches_pyproject PASSED
tests/test_claude_plugin.py::TestSkills::test_mcpbr_eval_mentions_docker PASSED
tests/test_claude_plugin.py::TestSkills::test_mcpbr_config_mentions_workdir PASSED
Adding Tests for Custom Skills¶
When creating a custom skill, add tests to verify:
- Skill directory and SKILL.md exist
- Frontmatter is valid and complete
- Critical keywords are present (Docker, {workdir}, etc.)
- Instructions section exists
- Examples are included
Example test:
def test_my_skill_mentions_critical_concept(skills_dir: Path) -> None:
"""Test that my-skill mentions critical concept."""
skill_path = skills_dir / "my-skill" / "SKILL.md"
content = skill_path.read_text()
assert "critical_concept" in content, "my-skill should mention critical_concept"
Related Resources¶
- Skills README - Detailed skill development guide
- Plugin Tests - Test suite for validation
- Contributing Guide - How to contribute skills
- CLI Reference - Complete mcpbr command documentation
- Configuration Guide - Config file reference
Support¶
If you encounter issues with the plugin:
- Check the Troubleshooting section above
- Review FAQ for common questions
- Run plugin tests:
pytest tests/test_claude_plugin.py -v - Open an issue on GitHub
- Join discussions in the repository
When reporting issues, include:
- Claude Code version
- mcpbr version
- Plugin version (from
.claude-plugin/plugin.json) - Error messages or unexpected behavior
- Steps to reproduce