Skip to content

littlebearapps/wpnav-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WP Navigator Benchmark Harness

Open-source benchmark methodology for evaluating AI coding agents on WordPress management tasks via MCP (Model Context Protocol) tools.

This repository contains the complete benchmark harness, scenario definitions, fairness protocol, and execution scripts used to evaluate WP Navigator against competitor MCP tools. It is published for independent verification — anyone can run these benchmarks to validate our claims.

Why Open-Source?

WP Navigator created and runs these benchmarks. Publishing the methodology addresses the "referee and player" problem — by making everything transparent, results can be independently verified.

What's Inside

harness/           TypeScript benchmark harness (runner, scoring, LLM judge)
scenarios/         51 YAML scenario definitions (Modes A, B, and E)
configs/mcp/       5 competitor MCP configurations
fixtures/          Deterministic test data and template install scripts
scripts/           Batch runners and orchestration scripts
MODE_B_FAIRNESS.md Fairness protocol and publication checklist
docker-compose.yml WordPress staging environment (Podman/Docker)

Benchmark Modes

Mode What It Tests Scenarios Method
A Initial tool-definition context (token count) 18 Scripted — no agent needed
B Real agent task completion 25 Live agent sessions against WordPress
C Cookbook effectiveness (with/without) 8 variants Paired comparison
D Safety features (kill switch, guardrails) 3 WP Navigator-specific
E Template-based quality (LLM-as-a-Judge) 8 5x5 matrix, automated scoring

Competitors Tested

Competitor Tool Count Config
WP Navigator 42 (5 meta + 37 underlying) configs/mcp/wpnav.mcp.json
Claudeus 145 configs/mcp/claudeus.mcp.json
Respira 88 configs/mcp/respira.mcp.json
InstaWP 36 configs/mcp/instawp.mcp.json
Raw REST API 0 (baseline) configs/mcp/raw-rest.mcp.json

Agents Tested (Tier 1)

Claude Code, Codex CLI, Gemini CLI, OpenCode, Pi

Quick Start

Prerequisites

  • Node.js 20+
  • Podman or Docker (for WordPress staging)
  • AI agent CLI tools (OpenCode, Codex CLI, etc.)
  • WordPress app password

1. Start WordPress staging

podman-compose up -d
# or: docker-compose up -d
# Waits for WordPress at http://localhost:8080

2. Install the benchmark harness

cd harness
npm install
npx tsc  # Build TypeScript

3. Seed test data

bash fixtures/seed-realworld.sh
bash fixtures/templates/install-template.sh

4. Run Mode A (token counting — no agent needed)

node harness/dist/index.js --mode A --competitor all

5. Run Mode B (real agent sessions)

# Set WordPress credentials
export WPNAV_URL=http://localhost:8080
export WPNAV_USERNAME=admin
export WPNAV_APP_PASSWORD=your_app_password

# Run with one agent
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B

# Run with a specific competitor's MCP tools
bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode B --competitor claudeus

6. Run Mode E (quality scoring with LLM judge)

export ANTHROPIC_API_KEY=your_key  # For the LLM judge

bash scripts/run-batch.sh --agent opencode --model openai/gpt-5.2 --mode E --scenarios f01-homepage-hero-update

Fairness Protocol

See MODE_B_FAIRNESS.md for the full protocol. Key rules:

  • Neutral directory — agents run from /tmp/wpnav-bench/, not the project directory
  • Identical prompts — every agent receives the exact prompt from the scenario YAML
  • Clean state — WordPress reset between scenarios
  • Report everything — all agents, all results, even unflattering ones
  • No insider knowledge — agents have no pre-loaded context about tool names

Adding a New Competitor

  1. Create configs/mcp/your-tool.mcp.json with the MCP server configuration
  2. Run: bash scripts/run-batch.sh --agent opencode --mode B --competitor your-tool
  3. Results are logged to results/progress.json

Adding a New Agent

  1. Add the agent's CLI command to scripts/run-batch.sh (see build_agent_cmd() function)
  2. Add config switching support in scripts/switch-config.sh
  3. Run benchmarks with --agent your-agent

Scenario Format (YAML)

mode: B
name: a01-create-post
description: Create a new blog post with title, content, and category
category: content
prompt: |
  Create a new WordPress blog post with...
validate:
  - endpoint: /wp/v2/posts?search=...
    contains: "expected text"
    description: "Post was created"

LLM-as-a-Judge (Mode E)

Mode E uses Claude Sonnet 4 as an automated quality judge:

  • Blinded — judge doesn't know which agent or competitor produced the result
  • Structured outputtool_use (function calling) for reliable JSON scores
  • Evidence hashing — SHA-256 of captured WordPress state for reproducibility
  • 4 weighted dimensions per scenario (e.g., correctness, preservation, quality, completeness)

Claim Scoping

When citing results from these benchmarks:

  • Token claims: "97.2% reduction in initial tool-definition context" (not total session cost)
  • Completion claims: Specify which mode and competitor set
  • Quality claims: Specify judge model and temperature
  • Always disclose: localhost testing, N=1 per scenario (unless N>=3 runs completed)

Licence

MIT. See LICENSE.

Links

About

WP Navigator benchmark harness — independently verifiable AI agent benchmarks for WordPress MCP tools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors