Add benchmark workflow and database integration#65
Conversation
- Introduced a new GitHub Actions workflow for benchmarking, triggered on pushes, pull requests, and scheduled runs. - Added a SQLite database for storing benchmark history and regression tracking, including functionality for saving and retrieving benchmark runs. - Implemented scripts for importing existing benchmark results into the new database format. - Created a CLI for comparing benchmark runs and evaluating performance across multiple repositories. - Enhanced quality scoring functions for better evaluation of benchmark outputs. These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.
There was a problem hiding this comment.
Pull request overview
This PR introduces a benchmarking + evaluation framework: a GitHub Actions benchmark workflow, a SQLite-backed history store for benchmark runs, and a new SWE-bench Lite evaluation CLI/tooling to run, grade, compare, and report results.
Changes:
- Added a Benchmark GitHub Actions workflow and a baseline JSON artifact for regression tracking.
- Implemented a SQLite benchmark history DB with import + compare CLIs.
- Added a SWE-bench Lite evaluation package (dataset loader, prompt builder, attoswarm adapter, grading, efficiency + reporting, CLI) and extended the eval harness to persist per-instance metadata.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/benchmark.yml |
Adds CI workflow to run benchmarks, comment on PRs, and update baseline on main. |
eval/benchmark_baseline.json |
Adds an initial benchmark baseline snapshot used for regression comparison. |
eval/benchmark_compare.py |
Adds CLI to compare benchmark runs from DB or JSON files. |
eval/benchmark_db.py |
Adds SQLite schema + persistence APIs for benchmark run history. |
eval/harness.py |
Extends results DB retrieval and stores richer per-instance metadata in RunResult. |
eval/import_benchmark_history.py |
Adds migration/import script to ingest existing JSON benchmark results into SQLite. |
eval/polyglot_bench.py |
Adds a separate polyglot benchmark runner/report generator. |
eval/quality_scorers.py |
Extracts reusable deterministic quality scoring functions for benchmark outputs. |
eval/swebench/__init__.py |
Introduces the eval.swebench package. |
eval/swebench/__main__.py |
Adds python -m eval.swebench entrypoint wiring. |
eval/swebench/adapter.py |
Adds an AgentFactory adapter that runs attoswarm as a subprocess for SWE-bench. |
eval/swebench/cli.py |
Adds SWE-bench CLI commands: run, grade, compare, efficiency, leaderboard. |
eval/swebench/config.py |
Adds SWE-bench evaluation config and builder for an attoswarm-compatible config dict. |
eval/swebench/dataset.py |
Adds SWE-bench Lite dataset loaders (JSONL + HuggingFace). |
eval/swebench/efficiency.py |
Adds efficiency metric extraction from run artifacts plus report formatting. |
eval/swebench/grader.py |
Adds local + official SWE-bench grading utilities. |
eval/swebench/prompt.py |
Adds structured goal/custom instruction prompt builders for SWE-bench instances. |
eval/swebench/report.py |
Adds leaderboard/comparison/per-repo report formatting for SWE-bench runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| from dataclasses import dataclass, field | ||
| from typing import Any |
There was a problem hiding this comment.
field is imported from dataclasses but not used in this file. Removing unused imports helps keep the config builder focused and avoids potential lint noise if this directory is added to ruff checks.
| - name: Run benchmarks | ||
| run: | | ||
| uv run python scripts/benchmark_ci.py \ | ||
| --repos attocode \ | ||
| --num-runs 3 \ | ||
| --no-fail-on-regression \ | ||
| --output-json benchmark_results.json \ | ||
| --output-comment benchmark_comment.md |
There was a problem hiding this comment.
The workflow runs python scripts/benchmark_ci.py, but that file does not exist in the repository (only scripts/setup.sh is present). As written, the workflow will fail immediately; either add the missing benchmark script (and ensure it produces benchmark_results.json/benchmark_comment.md) or update the workflow to call the correct existing entrypoint.
| "orchestration": { | ||
| "decomposition": cfg.decomposition_mode, | ||
| "max_tasks": cfg.max_tasks, | ||
| "max_depth": cfg.max_depth, | ||
| "custom_instructions": cfg.custom_instructions, | ||
| }, | ||
| "retry": { | ||
| "max_task_attempts": cfg.max_task_attempts, | ||
| }, | ||
| "workspace": { | ||
| "mode": cfg.workspace_mode, | ||
| }, | ||
| } |
There was a problem hiding this comment.
build_swarm_yaml_dict emits a retry section, but attoswarm’s config loader reads retries (see src/attoswarm/config/loader.py). This means max_task_attempts from SWEBenchEvalConfig will be silently ignored and defaults will apply. Rename the section to retries (and consider mapping max_concurrent_agents to workspace.max_concurrent_writers so concurrency is applied consistently).
| # Try to load state.json from run directory | ||
| state_path = os.path.join(working_dir, ".swarm-run", "state.json") | ||
| state: dict[str, Any] = {} | ||
| if os.path.exists(state_path): | ||
| with open(state_path) as f: | ||
| state = json.load(f) | ||
|
|
||
| return { | ||
| "completed": proc.returncode == 0, | ||
| "summary": output[:5000], | ||
| "tokens_used": state.get("budget", {}).get("tokens_used", 0), | ||
| "cost_usd": state.get("budget", {}).get("cost_usd", 0.0), | ||
| "tasks_completed": state.get("tasks_completed", 0), | ||
| "tasks_total": state.get("tasks_total", 0), | ||
| "tool_calls": 0, | ||
| "phase": state.get("phase", "unknown"), |
There was a problem hiding this comment.
The adapter tries to load attoswarm state from .swarm-run/state.json, but attoswarm persists swarm.state.json in config.run.run_dir (and the persisted events file is swarm.events.jsonl). With the current path, state will usually stay empty and tokens_used/cost_usd/task counts will report as 0 even on successful runs. Update the filenames/paths to match attoswarm’s run layout.
| def extract_efficiency(run_dir: str) -> EfficiencyMetrics: | ||
| """Extract efficiency metrics from a swarm run directory. | ||
|
|
||
| Expected structure: | ||
| run_dir/ | ||
| state.json — final orchestrator state | ||
| events.jsonl — event log | ||
| manifest.json — task manifest | ||
| tasks/ — per-task results | ||
| """ | ||
| metrics = EfficiencyMetrics() | ||
|
|
||
| state = _load_json(os.path.join(run_dir, "state.json")) | ||
| if not state: | ||
| return metrics | ||
|
|
||
| metrics.run_id = state.get("run_id", "") | ||
|
|
||
| # Task counts | ||
| tasks = state.get("tasks", {}) | ||
| if isinstance(tasks, dict): | ||
| task_list = list(tasks.values()) | ||
| elif isinstance(tasks, list): | ||
| task_list = tasks | ||
| else: | ||
| task_list = [] | ||
|
|
||
| metrics.total_tasks = len(task_list) | ||
| metrics.completed_tasks = sum( | ||
| 1 for t in task_list | ||
| if t.get("status") in ("done", "completed") | ||
| ) | ||
| metrics.failed_tasks = sum( | ||
| 1 for t in task_list | ||
| if t.get("status") in ("failed", "error", "skipped") | ||
| ) | ||
|
|
||
| # Task completion rate | ||
| if metrics.total_tasks > 0: | ||
| metrics.task_completion_rate = metrics.completed_tasks / metrics.total_tasks | ||
|
|
||
| # Budget | ||
| budget = state.get("budget", {}) | ||
| metrics.tokens_budgeted = budget.get("max_tokens", 0) | ||
| metrics.tokens_used = budget.get("tokens_used", 0) | ||
| if metrics.tokens_budgeted > 0: | ||
| metrics.budget_accuracy = metrics.tokens_used / metrics.tokens_budgeted | ||
|
|
||
| # Wall time | ||
| metrics.wall_time_seconds = state.get("elapsed_s", 0.0) | ||
|
|
||
| # Max concurrency from DAG summary | ||
| dag = state.get("dag_summary", {}) | ||
| metrics.max_concurrency = dag.get("max_parallelism", 1) or 1 | ||
|
|
||
| # Process events for parallelism and retry metrics | ||
| events = _load_events(os.path.join(run_dir, "events.jsonl")) | ||
| if events: | ||
| _process_events(metrics, events) | ||
|
|
There was a problem hiding this comment.
extract_efficiency expects state.json and events.jsonl with fields like type and agent_id, but attoswarm’s persisted files are swarm.state.json and swarm.events.jsonl and use different keys (e.g. event_type, agent_id, task_id, data). As a result this will commonly return all-zero metrics. Align the expected filenames and parsing logic with attoswarm’s default_run_layout() and EventBus JSONL format.
| def cmd_efficiency(args: argparse.Namespace) -> None: | ||
| """Analyze swarm efficiency for a run.""" | ||
| run_dir = args.run_dir | ||
| if not run_dir: | ||
| # Try to find run directory from default locations | ||
| candidates = [ | ||
| os.path.join(".agent", args.run_id), | ||
| os.path.join("/tmp", f"attocode-eval-{args.run_id}"), | ||
| ] | ||
| for c in candidates: | ||
| if os.path.isdir(c): | ||
| run_dir = c | ||
| break | ||
|
|
||
| if not run_dir or not os.path.isdir(run_dir): | ||
| print(f"Run directory not found for: {args.run_id}") | ||
| print("Use --run-dir to specify the path") | ||
| sys.exit(1) | ||
|
|
There was a problem hiding this comment.
cmd_efficiency tries default locations like .agent/<run_id> and /tmp/attocode-eval-<run_id>, but SWE-bench runs via EvalHarness create per-instance working dirs under a temp attocode-eval-* root, and the attoswarm run dir is configured as <instance_dir>/.swarm-run. With the current logic, --run-id is unlikely to resolve to an efficiency report without manually passing --run-dir. Consider resolving the run dirs from ResultsDB.get_run_results(...)[].metadata['working_dir'] and aggregating via extract_efficiency_batch.
| total = len(repo_results) | ||
| passed = sum(1 for r in repo_results if r.status == InstanceStatus.PASSED) | ||
| failed = sum(1 for r in repo_results if r.status == InstanceStatus.FAILED) | ||
| rate = passed / total if total > 0 else 0 | ||
| lines.append(f"| {repo} | {total} | {passed} | {failed} | {rate:.1%} |") |
There was a problem hiding this comment.
The per-repo breakdown counts failed only when status == InstanceStatus.FAILED, so instances with ERROR, TIMEOUT, or SKIPPED won’t be reflected in the table (and passed + failed may be less than total). Consider either adding columns for other statuses or treating any non-PASSED status as failed for reporting purposes.
| import json | ||
| import os | ||
| from dataclasses import dataclass, field | ||
| from typing import Any | ||
|
|
||
| from eval.harness import ResultsDB, RunResult, InstanceStatus | ||
| from eval.metrics import compute_metrics, compare_runs, format_comparison, EvalMetrics | ||
| from eval.swebench.efficiency import EfficiencyMetrics |
There was a problem hiding this comment.
This module has several unused imports (json, os, field, ResultsDB, EvalMetrics), which can obscure dependencies and may fail local linting if eval/ is added to ruff checks later. Please remove unused imports (or use them if intended).
- Added optional instance binding in EvalHarness for agent factories requiring full context. - Updated AttoswarmSWEBenchFactory to support run directory configuration and improved environment variable handling. - Enhanced efficiency metrics extraction to accommodate new swarm state file structure and improved task counting logic. - Modified CLI to allow specifying a database for efficiency analysis and improved run directory resolution. - Refactored report generation to treat non-passed terminal states as failures for clearer reporting. These changes improve the flexibility and accuracy of the SWE-bench evaluation process, enhancing user experience and reporting clarity.
- Updated exception handling in multiple files to catch `TimeoutError` directly instead of `asyncio.TimeoutError`. - Enhanced readability by simplifying conditional checks and variable assignments in various functions. - Improved import organization and removed unnecessary lines to streamline code structure. - Added type hints for better clarity and maintainability. These changes enhance code clarity and maintainability while ensuring consistent error handling across the codebase.
These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.