nearai-bench

Benchmarking harness for evaluating AI agents. Extracted from ironclaw.

Available Suites

Suite	Description
`spot`	End-to-end spot checks: conversation, tool use, chaining, robustness
`custom`	Custom JSONL tasks with flexible scoring (exact, contains, regex, LLM)
`gaia`	GAIA benchmark (knowledge and reasoning)
`tau_bench`	Tau-bench (multi-turn tool-calling dialog)
`swe_bench`	SWE-bench Pro (real-world software engineering)

Quick Start

# List available suites
nearai-bench list

# Run spot checks with a config
nearai-bench run --suite spot --config suites/spot.toml

# Run with a specific model
nearai-bench run --suite spot --config suites/spot.toml --model openai/gpt-4o

# View latest results
nearai-bench results latest

# Compare two runs
nearai-bench compare <baseline-uuid> <comparison-uuid>

Project Structure

benchmarks/
  datasets/          Versioned benchmark datasets
    spot/v1/           21 spot-check tasks
    swe-bench-lite/v1/ SWE-bench Lite dataset (astropy subset)
  suites/            Suite configuration files (TOML)
  baselines/         Curated reference results by suite
  results/           Run output, namespaced by harness
    ironclaw/          Results from the ironclaw harness
  src/               Harness source code
    adapters/          Suite adapter implementations

Datasets

Datasets live under datasets/{suite-name}/v{N}/tasks.jsonl. The versioning scheme lets datasets evolve without invalidating older results that reference a prior version.

Adding a New Dataset

Create datasets/{name}/v1/tasks.jsonl in the appropriate JSONL format.
Create suites/{name}.toml pointing suite_config.dataset_path at the new file.
If the suite type doesn't exist, implement a new adapter in src/adapters/.

Results

Results are written to results/{harness}/{run-uuid}/ containing:

run.json: aggregate metrics (pass rate, cost, timing, model, harness)
tasks.jsonl: per-task results with scores, traces, and responses

The harness field in run.json identifies which agent implementation produced the results, allowing multiple harnesses to share the same results directory structure.

Configuration

Suite configs are TOML files with this structure:

task_timeout = "120s"
parallelism = 1

[[matrix]]
label = "default"
# model = "openai/gpt-4o"  # optional model override

[suite_config]
dataset_path = "datasets/spot/v1/tasks.jsonl"

License

MIT OR Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
baselines/spot/gpt5.2-2c43b83		baselines/spot/gpt5.2-2c43b83
datasets/spot/v1		datasets/spot/v1
src		src
suites		suites
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nearai-bench

Available Suites

Quick Start

Project Structure

Datasets

Adding a New Dataset

Results

Configuration

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

nearai/benchmarks

Folders and files

Latest commit

History

Repository files navigation

nearai-bench

Available Suites

Quick Start

Project Structure

Datasets

Adding a New Dataset

Results

Configuration

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages