Add real benchmark data with ReAct agent by alepot55 · Pull Request #10 · alepot55/agentrial

alepot55 · 2026-02-06T17:20:55Z

Summary

ReAct-style agent with 4 deterministic tools, using raw Anthropic API calls
7 test cases (easy → adversarial), 20 trials each on claude-3-haiku-20240307
Results: 94.3% overall (89.1%-97.1% CI), ARS 79.7/100, total cost $0.11
Key finding: simple-math at 70% due to comma formatting variance ("4,446" vs "4446")

Files

agent.py — ReAct agent implementation (Anthropic + Google support)
agentrial.yml — Test suite with 7 cases
results_haiku.json — Full JSON results (166KB)
flamegraph_haiku.html — Interactive trajectory flame graph
baseline.json — Baseline for regression detection
terminal_pretty.txt / ars_output.txt — Terminal outputs for screenshots

Test plan

Agent tested manually with single queries
Full 20-trial benchmark run completed successfully
ARS score computed: 79.7/100
Baseline saved for future regression detection
No API keys in committed files

🤖 Generated with Claude Code

- ReAct agent with 4 tools (calculator, search_knowledge, unit_converter, date_info) - Uses raw Anthropic API calls (no framework dependencies) - 7 test cases from easy to adversarial, 20 trials each - Results: 94.3% overall pass rate (89.1%-97.1% CI), ARS 79.7/100 - Key finding: simple-math at 70% due to comma formatting variance - Includes flamegraph HTML, baseline JSON, and terminal output Model: claude-3-haiku-20240307 | Total cost: $0.11 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alepot55 merged commit a69e26b into main Feb 6, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add real benchmark data with ReAct agent#10

Add real benchmark data with ReAct agent#10
alepot55 merged 1 commit into
mainfrom
feat/benchmark-data

alepot55 commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alepot55 commented Feb 6, 2026

Summary

Files

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant