Add benchmark workflow and database integration by eren23 · Pull Request #65 · eren23/attocode

eren23 · 2026-03-10T13:01:44Z

Introduced a new GitHub Actions workflow for benchmarking, triggered on pushes, pull requests, and scheduled runs.
Added a SQLite database for storing benchmark history and regression tracking, including functionality for saving and retrieving benchmark runs.
Implemented scripts for importing existing benchmark results into the new database format.
Created a CLI for comparing benchmark runs and evaluating performance across multiple repositories.
Enhanced quality scoring functions for better evaluation of benchmark outputs.

These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.

- Introduced a new GitHub Actions workflow for benchmarking, triggered on pushes, pull requests, and scheduled runs. - Added a SQLite database for storing benchmark history and regression tracking, including functionality for saving and retrieving benchmark runs. - Implemented scripts for importing existing benchmark results into the new database format. - Created a CLI for comparing benchmark runs and evaluating performance across multiple repositories. - Enhanced quality scoring functions for better evaluation of benchmark outputs. These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.

Copilot

Pull request overview

This PR introduces a benchmarking + evaluation framework: a GitHub Actions benchmark workflow, a SQLite-backed history store for benchmark runs, and a new SWE-bench Lite evaluation CLI/tooling to run, grade, compare, and report results.

Changes:

Added a Benchmark GitHub Actions workflow and a baseline JSON artifact for regression tracking.
Implemented a SQLite benchmark history DB with import + compare CLIs.
Added a SWE-bench Lite evaluation package (dataset loader, prompt builder, attoswarm adapter, grading, efficiency + reporting, CLI) and extended the eval harness to persist per-instance metadata.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`.github/workflows/benchmark.yml`	Adds CI workflow to run benchmarks, comment on PRs, and update baseline on main.
`eval/benchmark_baseline.json`	Adds an initial benchmark baseline snapshot used for regression comparison.
`eval/benchmark_compare.py`	Adds CLI to compare benchmark runs from DB or JSON files.
`eval/benchmark_db.py`	Adds SQLite schema + persistence APIs for benchmark run history.
`eval/harness.py`	Extends results DB retrieval and stores richer per-instance metadata in `RunResult`.
`eval/import_benchmark_history.py`	Adds migration/import script to ingest existing JSON benchmark results into SQLite.
`eval/polyglot_bench.py`	Adds a separate polyglot benchmark runner/report generator.
`eval/quality_scorers.py`	Extracts reusable deterministic quality scoring functions for benchmark outputs.
`eval/swebench/__init__.py`	Introduces the `eval.swebench` package.
`eval/swebench/__main__.py`	Adds `python -m eval.swebench` entrypoint wiring.
`eval/swebench/adapter.py`	Adds an `AgentFactory` adapter that runs `attoswarm` as a subprocess for SWE-bench.
`eval/swebench/cli.py`	Adds SWE-bench CLI commands: run, grade, compare, efficiency, leaderboard.
`eval/swebench/config.py`	Adds SWE-bench evaluation config and builder for an attoswarm-compatible config dict.
`eval/swebench/dataset.py`	Adds SWE-bench Lite dataset loaders (JSONL + HuggingFace).
`eval/swebench/efficiency.py`	Adds efficiency metric extraction from run artifacts plus report formatting.
`eval/swebench/grader.py`	Adds local + official SWE-bench grading utilities.
`eval/swebench/prompt.py`	Adds structured goal/custom instruction prompt builders for SWE-bench instances.
`eval/swebench/report.py`	Adds leaderboard/comparison/per-repo report formatting for SWE-bench runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T13:10:52Z

+from dataclasses import dataclass, field
+from typing import Any


field is imported from dataclasses but not used in this file. Removing unused imports helps keep the config builder focused and avoids potential lint noise if this directory is added to ruff checks.

Copilot · 2026-03-10T13:10:52Z

+      - name: Run benchmarks
+        run: |
+          uv run python scripts/benchmark_ci.py \
+            --repos attocode \
+            --num-runs 3 \
+            --no-fail-on-regression \
+            --output-json benchmark_results.json \
+            --output-comment benchmark_comment.md


The workflow runs python scripts/benchmark_ci.py, but that file does not exist in the repository (only scripts/setup.sh is present). As written, the workflow will fail immediately; either add the missing benchmark script (and ensure it produces benchmark_results.json/benchmark_comment.md) or update the workflow to call the correct existing entrypoint.

Copilot · 2026-03-10T13:10:53Z

+        "orchestration": {
+            "decomposition": cfg.decomposition_mode,
+            "max_tasks": cfg.max_tasks,
+            "max_depth": cfg.max_depth,
+            "custom_instructions": cfg.custom_instructions,
+        },
+        "retry": {
+            "max_task_attempts": cfg.max_task_attempts,
+        },
+        "workspace": {
+            "mode": cfg.workspace_mode,
+        },
+    }


build_swarm_yaml_dict emits a retry section, but attoswarm’s config loader reads retries (see src/attoswarm/config/loader.py). This means max_task_attempts from SWEBenchEvalConfig will be silently ignored and defaults will apply. Rename the section to retries (and consider mapping max_concurrent_agents to workspace.max_concurrent_writers so concurrency is applied consistently).

Copilot · 2026-03-10T13:10:53Z

+            # Try to load state.json from run directory
+            state_path = os.path.join(working_dir, ".swarm-run", "state.json")
+            state: dict[str, Any] = {}
+            if os.path.exists(state_path):
+                with open(state_path) as f:
+                    state = json.load(f)
+
+            return {
+                "completed": proc.returncode == 0,
+                "summary": output[:5000],
+                "tokens_used": state.get("budget", {}).get("tokens_used", 0),
+                "cost_usd": state.get("budget", {}).get("cost_usd", 0.0),
+                "tasks_completed": state.get("tasks_completed", 0),
+                "tasks_total": state.get("tasks_total", 0),
+                "tool_calls": 0,
+                "phase": state.get("phase", "unknown"),


The adapter tries to load attoswarm state from .swarm-run/state.json, but attoswarm persists swarm.state.json in config.run.run_dir (and the persisted events file is swarm.events.jsonl). With the current path, state will usually stay empty and tokens_used/cost_usd/task counts will report as 0 even on successful runs. Update the filenames/paths to match attoswarm’s run layout.

Copilot · 2026-03-10T13:10:53Z

+def extract_efficiency(run_dir: str) -> EfficiencyMetrics:
+    """Extract efficiency metrics from a swarm run directory.
+
+    Expected structure:
+        run_dir/
+            state.json          — final orchestrator state
+            events.jsonl        — event log
+            manifest.json       — task manifest
+            tasks/              — per-task results
+    """
+    metrics = EfficiencyMetrics()
+
+    state = _load_json(os.path.join(run_dir, "state.json"))
+    if not state:
+        return metrics
+
+    metrics.run_id = state.get("run_id", "")
+
+    # Task counts
+    tasks = state.get("tasks", {})
+    if isinstance(tasks, dict):
+        task_list = list(tasks.values())
+    elif isinstance(tasks, list):
+        task_list = tasks
+    else:
+        task_list = []
+
+    metrics.total_tasks = len(task_list)
+    metrics.completed_tasks = sum(
+        1 for t in task_list
+        if t.get("status") in ("done", "completed")
+    )
+    metrics.failed_tasks = sum(
+        1 for t in task_list
+        if t.get("status") in ("failed", "error", "skipped")
+    )
+
+    # Task completion rate
+    if metrics.total_tasks > 0:
+        metrics.task_completion_rate = metrics.completed_tasks / metrics.total_tasks
+
+    # Budget
+    budget = state.get("budget", {})
+    metrics.tokens_budgeted = budget.get("max_tokens", 0)
+    metrics.tokens_used = budget.get("tokens_used", 0)
+    if metrics.tokens_budgeted > 0:
+        metrics.budget_accuracy = metrics.tokens_used / metrics.tokens_budgeted
+
+    # Wall time
+    metrics.wall_time_seconds = state.get("elapsed_s", 0.0)
+
+    # Max concurrency from DAG summary
+    dag = state.get("dag_summary", {})
+    metrics.max_concurrency = dag.get("max_parallelism", 1) or 1
+
+    # Process events for parallelism and retry metrics
+    events = _load_events(os.path.join(run_dir, "events.jsonl"))
+    if events:
+        _process_events(metrics, events)
+


extract_efficiency expects state.json and events.jsonl with fields like type and agent_id, but attoswarm’s persisted files are swarm.state.json and swarm.events.jsonl and use different keys (e.g. event_type, agent_id, task_id, data). As a result this will commonly return all-zero metrics. Align the expected filenames and parsing logic with attoswarm’s default_run_layout() and EventBus JSONL format.

Copilot · 2026-03-10T13:10:53Z

+def cmd_efficiency(args: argparse.Namespace) -> None:
+    """Analyze swarm efficiency for a run."""
+    run_dir = args.run_dir
+    if not run_dir:
+        # Try to find run directory from default locations
+        candidates = [
+            os.path.join(".agent", args.run_id),
+            os.path.join("/tmp", f"attocode-eval-{args.run_id}"),
+        ]
+        for c in candidates:
+            if os.path.isdir(c):
+                run_dir = c
+                break
+
+    if not run_dir or not os.path.isdir(run_dir):
+        print(f"Run directory not found for: {args.run_id}")
+        print("Use --run-dir to specify the path")
+        sys.exit(1)
+


cmd_efficiency tries default locations like .agent/<run_id> and /tmp/attocode-eval-<run_id>, but SWE-bench runs via EvalHarness create per-instance working dirs under a temp attocode-eval-* root, and the attoswarm run dir is configured as <instance_dir>/.swarm-run. With the current logic, --run-id is unlikely to resolve to an efficiency report without manually passing --run-dir. Consider resolving the run dirs from ResultsDB.get_run_results(...)[].metadata['working_dir'] and aggregating via extract_efficiency_batch.

Copilot · 2026-03-10T13:10:54Z

+        total = len(repo_results)
+        passed = sum(1 for r in repo_results if r.status == InstanceStatus.PASSED)
+        failed = sum(1 for r in repo_results if r.status == InstanceStatus.FAILED)
+        rate = passed / total if total > 0 else 0
+        lines.append(f"| {repo} | {total} | {passed} | {failed} | {rate:.1%} |")


The per-repo breakdown counts failed only when status == InstanceStatus.FAILED, so instances with ERROR, TIMEOUT, or SKIPPED won’t be reflected in the table (and passed + failed may be less than total). Consider either adding columns for other statuses or treating any non-PASSED status as failed for reporting purposes.

Copilot · 2026-03-10T13:10:54Z

+import json
+import os
+from dataclasses import dataclass, field
+from typing import Any
+
+from eval.harness import ResultsDB, RunResult, InstanceStatus
+from eval.metrics import compute_metrics, compare_runs, format_comparison, EvalMetrics
+from eval.swebench.efficiency import EfficiencyMetrics


This module has several unused imports (json, os, field, ResultsDB, EvalMetrics), which can obscure dependencies and may fail local linting if eval/ is added to ruff checks later. Please remove unused imports (or use them if intended).

- Added optional instance binding in EvalHarness for agent factories requiring full context. - Updated AttoswarmSWEBenchFactory to support run directory configuration and improved environment variable handling. - Enhanced efficiency metrics extraction to accommodate new swarm state file structure and improved task counting logic. - Modified CLI to allow specifying a database for efficiency analysis and improved run directory resolution. - Refactored report generation to treat non-passed terminal states as failures for clearer reporting. These changes improve the flexibility and accuracy of the SWE-bench evaluation process, enhancing user experience and reporting clarity.

- Updated exception handling in multiple files to catch `TimeoutError` directly instead of `asyncio.TimeoutError`. - Enhanced readability by simplifying conditional checks and variable assignments in various functions. - Improved import organization and removed unnecessary lines to streamline code structure. - Added type hints for better clarity and maintainability. These changes enhance code clarity and maintainability while ensuring consistent error handling across the codebase.

eren23 requested a review from Copilot March 10, 2026 13:03

Copilot started reviewing on behalf of eren23 March 10, 2026 13:03 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

eren23 added 3 commits March 10, 2026 20:03

disable ci, to be fixed and delivered later

4bc0dcd

eren23 merged commit 5cf32e2 into main Mar 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark workflow and database integration#65

Add benchmark workflow and database integration#65
eren23 merged 4 commits into
mainfrom
feat/update-59

eren23 commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Copilot AI Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from dataclasses import dataclass, field
		from typing import Any

Conversation

eren23 commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants