Skip to content

Add real benchmark data with ReAct agent#10

Merged
alepot55 merged 1 commit into
mainfrom
feat/benchmark-data
Feb 6, 2026
Merged

Add real benchmark data with ReAct agent#10
alepot55 merged 1 commit into
mainfrom
feat/benchmark-data

Conversation

@alepot55

@alepot55 alepot55 commented Feb 6, 2026

Copy link
Copy Markdown
Owner

Summary

  • ReAct-style agent with 4 deterministic tools, using raw Anthropic API calls
  • 7 test cases (easy → adversarial), 20 trials each on claude-3-haiku-20240307
  • Results: 94.3% overall (89.1%-97.1% CI), ARS 79.7/100, total cost $0.11
  • Key finding: simple-math at 70% due to comma formatting variance ("4,446" vs "4446")

Files

  • agent.py — ReAct agent implementation (Anthropic + Google support)
  • agentrial.yml — Test suite with 7 cases
  • results_haiku.json — Full JSON results (166KB)
  • flamegraph_haiku.html — Interactive trajectory flame graph
  • baseline.json — Baseline for regression detection
  • terminal_pretty.txt / ars_output.txt — Terminal outputs for screenshots

Test plan

  • Agent tested manually with single queries
  • Full 20-trial benchmark run completed successfully
  • ARS score computed: 79.7/100
  • Baseline saved for future regression detection
  • No API keys in committed files

🤖 Generated with Claude Code

- ReAct agent with 4 tools (calculator, search_knowledge, unit_converter, date_info)
- Uses raw Anthropic API calls (no framework dependencies)
- 7 test cases from easy to adversarial, 20 trials each
- Results: 94.3% overall pass rate (89.1%-97.1% CI), ARS 79.7/100
- Key finding: simple-math at 70% due to comma formatting variance
- Includes flamegraph HTML, baseline JSON, and terminal output

Model: claude-3-haiku-20240307 | Total cost: $0.11

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alepot55 alepot55 merged commit a69e26b into main Feb 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant