Open-source benchmark methodology for evaluating AI coding agents on WordPress management tasks via MCP (Model Context Protocol) tools.
This repository contains the complete benchmark harness, scenario definitions, fairness protocol, and execution scripts used to evaluate WP Navigator against competitor MCP tools. It is published for independent verification — anyone can run these benchmarks to validate our claims.
WP Navigator created and runs these benchmarks. Publishing the methodology addresses the "referee and player" problem — by making everything transparent, results can be independently verified.
harness/ TypeScript benchmark harness (runner, scoring, LLM judge)
scenarios/ 51 YAML scenario definitions (Modes A, B, and E)
configs/mcp/ 5 competitor MCP configurations
fixtures/ Deterministic test data and template install scripts
scripts/ Batch runners and orchestration scripts
MODE_B_FAIRNESS.md Fairness protocol and publication checklist
docker-compose.yml WordPress staging environment (Podman/Docker)
| Mode | What It Tests | Scenarios | Method |
|---|---|---|---|
| A | Initial tool-definition context (token count) | 18 | Scripted — no agent needed |
| B | Real agent task completion | 25 | Live agent sessions against WordPress |
| C | Cookbook effectiveness (with/without) | 8 variants | Paired comparison |
| D | Safety features (kill switch, guardrails) | 3 | WP Navigator-specific |
| E | Template-based quality (LLM-as-a-Judge) | 8 | 5x5 matrix, automated scoring |
| Competitor | Tool Count | Config |
|---|---|---|
| WP Navigator | 42 (5 meta + 37 underlying) | configs/mcp/wpnav.mcp.json |
| Claudeus | 145 | configs/mcp/claudeus.mcp.json |
| Respira | 88 | configs/mcp/respira.mcp.json |
| InstaWP | 36 | configs/mcp/instawp.mcp.json |
| Raw REST API | 0 (baseline) | configs/mcp/raw-rest.mcp.json |
Claude Code, Codex CLI, Gemini CLI, OpenCode, Pi
- Node.js 20+
- Podman or Docker (for WordPress staging)
- AI agent CLI tools (OpenCode, Codex CLI, etc.)
- WordPress app password
podman-compose up -d
# or: docker-compose up -d
# Waits for WordPress at http://localhost:8080cd harness
npm install
npx tsc # Build TypeScriptbash fixtures/seed-realworld.sh
bash fixtures/templates/install-template.shnode harness/dist/index.js --mode A --competitor all# Set WordPress credentials
export WPNAV_URL=http://localhost:8080
export WPNAV_USERNAME=admin
export WPNAV_APP_PASSWORD=your_app_password
# Run with one agent
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B
# Run with a specific competitor's MCP tools
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B --competitor claudeusexport ANTHROPIC_API_KEY=your_key # For the LLM judge
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode E --scenarios f01-homepage-hero-updateSee MODE_B_FAIRNESS.md for the full protocol. Key rules:
- Neutral directory — agents run from
/tmp/wpnav-bench/, not the project directory - Identical prompts — every agent receives the exact prompt from the scenario YAML
- Clean state — WordPress reset between scenarios
- Report everything — all agents, all results, even unflattering ones
- No insider knowledge — agents have no pre-loaded context about tool names
- Create
configs/mcp/your-tool.mcp.jsonwith the MCP server configuration - Run:
bash scripts/run-batch.sh --agent opencode --mode B --competitor your-tool - Results are logged to
results/progress.json
- Add the agent's CLI command to
scripts/run-batch.sh(seebuild_agent_cmd()function) - Add config switching support in
scripts/switch-config.sh - Run benchmarks with
--agent your-agent
mode: B
name: a01-create-post
description: Create a new blog post with title, content, and category
category: content
prompt: |
Create a new WordPress blog post with...
validate:
- endpoint: /wp/v2/posts?search=...
contains: "expected text"
description: "Post was created"Mode E uses Claude Sonnet 4 as an automated quality judge:
- Blinded — judge doesn't know which agent or competitor produced the result
- Structured output —
tool_use(function calling) for reliable JSON scores - Evidence hashing — SHA-256 of captured WordPress state for reproducibility
- 4 weighted dimensions per scenario (e.g., correctness, preservation, quality, completeness)
When citing results from these benchmarks:
- Token claims: "97.2% reduction in initial tool-definition context" (not total session cost)
- Completion claims: Specify which mode and competitor set
- Quality claims: Specify judge model and temperature
- Always disclose: localhost testing, N=1 per scenario (unless N>=3 runs completed)
MIT. See LICENSE.
- WP Navigator — the WordPress MCP tool being benchmarked
- WP Navigator GitHub — main repository (private during development)