WP Navigator Benchmark Harness

Open-source benchmark methodology for evaluating AI coding agents on WordPress management tasks via MCP (Model Context Protocol) tools.

This repository contains the complete benchmark harness, scenario definitions, fairness protocol, and execution scripts used to evaluate WP Navigator against competitor MCP tools. It is published for independent verification — anyone can run these benchmarks to validate our claims.

Why Open-Source?

WP Navigator created and runs these benchmarks. Publishing the methodology addresses the "referee and player" problem — by making everything transparent, results can be independently verified.

What's Inside

harness/           TypeScript benchmark harness (runner, scoring, LLM judge)
scenarios/         51 YAML scenario definitions (Modes A, B, and E)
configs/mcp/       5 competitor MCP configurations
fixtures/          Deterministic test data and template install scripts
scripts/           Batch runners and orchestration scripts
MODE_B_FAIRNESS.md Fairness protocol and publication checklist
docker-compose.yml WordPress staging environment (Podman/Docker)

Benchmark Modes

Mode	What It Tests	Scenarios	Method
A	Initial tool-definition context (token count)	18	Scripted — no agent needed
B	Real agent task completion	25	Live agent sessions against WordPress
C	Cookbook effectiveness (with/without)	8 variants	Paired comparison
D	Safety features (kill switch, guardrails)	3	WP Navigator-specific
E	Template-based quality (LLM-as-a-Judge)	8	5x5 matrix, automated scoring

Competitors Tested

Competitor	Tool Count	Config
WP Navigator	42 (5 meta + 37 underlying)	`configs/mcp/wpnav.mcp.json`
Claudeus	145	`configs/mcp/claudeus.mcp.json`
Respira	88	`configs/mcp/respira.mcp.json`
InstaWP	36	`configs/mcp/instawp.mcp.json`
Raw REST API	0 (baseline)	`configs/mcp/raw-rest.mcp.json`

Agents Tested (Tier 1)

Claude Code, Codex CLI, Gemini CLI, OpenCode, Pi

Quick Start

Prerequisites

Node.js 20+
Podman or Docker (for WordPress staging)
AI agent CLI tools (OpenCode, Codex CLI, etc.)
WordPress app password

1. Start WordPress staging

podman-compose up -d
# or: docker-compose up -d
# Waits for WordPress at http://localhost:8080

2. Install the benchmark harness

cd harness
npm install
npx tsc  # Build TypeScript

3. Seed test data

bash fixtures/seed-realworld.sh
bash fixtures/templates/install-template.sh

4. Run Mode A (token counting — no agent needed)

node harness/dist/index.js --mode A --competitor all

5. Run Mode B (real agent sessions)

# Set WordPress credentials
export WPNAV_URL=http://localhost:8080
export WPNAV_USERNAME=admin
export WPNAV_APP_PASSWORD=your_app_password

# Run with one agent
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B

# Run with a specific competitor's MCP tools
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B --competitor claudeus

6. Run Mode E (quality scoring with LLM judge)

export ANTHROPIC_API_KEY=your_key  # For the LLM judge

bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode E --scenarios f01-homepage-hero-update

Fairness Protocol

See MODE_B_FAIRNESS.md for the full protocol. Key rules:

Neutral directory — agents run from /tmp/wpnav-bench/, not the project directory
Identical prompts — every agent receives the exact prompt from the scenario YAML
Clean state — WordPress reset between scenarios
Report everything — all agents, all results, even unflattering ones
No insider knowledge — agents have no pre-loaded context about tool names

Adding a New Competitor

Create configs/mcp/your-tool.mcp.json with the MCP server configuration
Run: bash scripts/run-batch.sh --agent opencode --mode B --competitor your-tool
Results are logged to results/progress.json

Adding a New Agent

Add the agent's CLI command to scripts/run-batch.sh (see build_agent_cmd() function)
Add config switching support in scripts/switch-config.sh
Run benchmarks with --agent your-agent

Scenario Format (YAML)

mode: B
name: a01-create-post
description: Create a new blog post with title, content, and category
category: content
prompt: |
  Create a new WordPress blog post with...
validate:
  - endpoint: /wp/v2/posts?search=...
    contains: "expected text"
    description: "Post was created"

LLM-as-a-Judge (Mode E)

Mode E uses Claude Sonnet 4 as an automated quality judge:

Blinded — judge doesn't know which agent or competitor produced the result
Structured output — tool_use (function calling) for reliable JSON scores
Evidence hashing — SHA-256 of captured WordPress state for reproducibility
4 weighted dimensions per scenario (e.g., correctness, preservation, quality, completeness)

Claim Scoping

When citing results from these benchmarks:

Token claims: "97.2% reduction in initial tool-definition context" (not total session cost)
Completion claims: Specify which mode and competitor set
Quality claims: Specify judge model and temperature
Always disclose: localhost testing, N=1 per scenario (unless N>=3 runs completed)

Licence

MIT. See LICENSE.

Links

WP Navigator — the WordPress MCP tool being benchmarked
WP Navigator GitHub — main repository (private during development)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WP Navigator Benchmark Harness

Why Open-Source?

What's Inside

Benchmark Modes

Competitors Tested

Agents Tested (Tier 1)

Quick Start

Prerequisites

1. Start WordPress staging

2. Install the benchmark harness

3. Seed test data

4. Run Mode A (token counting — no agent needed)

5. Run Mode B (real agent sessions)

6. Run Mode E (quality scoring with LLM judge)

Fairness Protocol

Adding a New Competitor

Adding a New Agent

Scenario Format (YAML)

LLM-as-a-Judge (Mode E)

Claim Scoping

Licence

Links

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs/mcp		configs/mcp
fixtures		fixtures
harness		harness
scenarios		scenarios
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
MODE_B_FAIRNESS.md		MODE_B_FAIRNESS.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

WP Navigator Benchmark Harness

Why Open-Source?

What's Inside

Benchmark Modes

Competitors Tested

Agents Tested (Tier 1)

Quick Start

Prerequisites

1. Start WordPress staging

2. Install the benchmark harness

3. Seed test data

4. Run Mode A (token counting — no agent needed)

5. Run Mode B (real agent sessions)

6. Run Mode E (quality scoring with LLM judge)

Fairness Protocol

Adding a New Competitor

Adding a New Agent

Scenario Format (YAML)

LLM-as-a-Judge (Mode E)

Claim Scoping

Licence

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages