Benchmarking harness for evaluating AI agents. Extracted from ironclaw.
| Suite | Description |
|---|---|
spot |
End-to-end spot checks: conversation, tool use, chaining, robustness |
custom |
Custom JSONL tasks with flexible scoring (exact, contains, regex, LLM) |
gaia |
GAIA benchmark (knowledge and reasoning) |
tau_bench |
Tau-bench (multi-turn tool-calling dialog) |
swe_bench |
SWE-bench Pro (real-world software engineering) |
# List available suites
nearai-bench list
# Run spot checks with a config
nearai-bench run --suite spot --config suites/spot.toml
# Run with a specific model
nearai-bench run --suite spot --config suites/spot.toml --model openai/gpt-4o
# View latest results
nearai-bench results latest
# Compare two runs
nearai-bench compare <baseline-uuid> <comparison-uuid>benchmarks/
datasets/ Versioned benchmark datasets
spot/v1/ 21 spot-check tasks
swe-bench-lite/v1/ SWE-bench Lite dataset (astropy subset)
suites/ Suite configuration files (TOML)
baselines/ Curated reference results by suite
results/ Run output, namespaced by harness
ironclaw/ Results from the ironclaw harness
src/ Harness source code
adapters/ Suite adapter implementations
Datasets live under datasets/{suite-name}/v{N}/tasks.jsonl. The versioning scheme lets
datasets evolve without invalidating older results that reference a prior version.
- Create
datasets/{name}/v1/tasks.jsonlin the appropriate JSONL format. - Create
suites/{name}.tomlpointingsuite_config.dataset_pathat the new file. - If the suite type doesn't exist, implement a new adapter in
src/adapters/.
Results are written to results/{harness}/{run-uuid}/ containing:
run.json: aggregate metrics (pass rate, cost, timing, model, harness)tasks.jsonl: per-task results with scores, traces, and responses
The harness field in run.json identifies which agent implementation produced the results,
allowing multiple harnesses to share the same results directory structure.
Suite configs are TOML files with this structure:
task_timeout = "120s"
parallelism = 1
[[matrix]]
label = "default"
# model = "openai/gpt-4o" # optional model override
[suite_config]
dataset_path = "datasets/spot/v1/tasks.jsonl"MIT OR Apache-2.0