Skip to content

Add benchmark workflow and database integration#65

Merged
eren23 merged 4 commits into
mainfrom
feat/update-59
Mar 10, 2026
Merged

Add benchmark workflow and database integration#65
eren23 merged 4 commits into
mainfrom
feat/update-59

Conversation

@eren23

@eren23 eren23 commented Mar 10, 2026

Copy link
Copy Markdown
Owner
  • Introduced a new GitHub Actions workflow for benchmarking, triggered on pushes, pull requests, and scheduled runs.
  • Added a SQLite database for storing benchmark history and regression tracking, including functionality for saving and retrieving benchmark runs.
  • Implemented scripts for importing existing benchmark results into the new database format.
  • Created a CLI for comparing benchmark runs and evaluating performance across multiple repositories.
  • Enhanced quality scoring functions for better evaluation of benchmark outputs.

These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.

- Introduced a new GitHub Actions workflow for benchmarking, triggered on pushes, pull requests, and scheduled runs.
- Added a SQLite database for storing benchmark history and regression tracking, including functionality for saving and retrieving benchmark runs.
- Implemented scripts for importing existing benchmark results into the new database format.
- Created a CLI for comparing benchmark runs and evaluating performance across multiple repositories.
- Enhanced quality scoring functions for better evaluation of benchmark outputs.

These changes establish a comprehensive benchmarking framework, facilitating regression detection and performance analysis across various repositories.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a benchmarking + evaluation framework: a GitHub Actions benchmark workflow, a SQLite-backed history store for benchmark runs, and a new SWE-bench Lite evaluation CLI/tooling to run, grade, compare, and report results.

Changes:

  • Added a Benchmark GitHub Actions workflow and a baseline JSON artifact for regression tracking.
  • Implemented a SQLite benchmark history DB with import + compare CLIs.
  • Added a SWE-bench Lite evaluation package (dataset loader, prompt builder, attoswarm adapter, grading, efficiency + reporting, CLI) and extended the eval harness to persist per-instance metadata.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
.github/workflows/benchmark.yml Adds CI workflow to run benchmarks, comment on PRs, and update baseline on main.
eval/benchmark_baseline.json Adds an initial benchmark baseline snapshot used for regression comparison.
eval/benchmark_compare.py Adds CLI to compare benchmark runs from DB or JSON files.
eval/benchmark_db.py Adds SQLite schema + persistence APIs for benchmark run history.
eval/harness.py Extends results DB retrieval and stores richer per-instance metadata in RunResult.
eval/import_benchmark_history.py Adds migration/import script to ingest existing JSON benchmark results into SQLite.
eval/polyglot_bench.py Adds a separate polyglot benchmark runner/report generator.
eval/quality_scorers.py Extracts reusable deterministic quality scoring functions for benchmark outputs.
eval/swebench/__init__.py Introduces the eval.swebench package.
eval/swebench/__main__.py Adds python -m eval.swebench entrypoint wiring.
eval/swebench/adapter.py Adds an AgentFactory adapter that runs attoswarm as a subprocess for SWE-bench.
eval/swebench/cli.py Adds SWE-bench CLI commands: run, grade, compare, efficiency, leaderboard.
eval/swebench/config.py Adds SWE-bench evaluation config and builder for an attoswarm-compatible config dict.
eval/swebench/dataset.py Adds SWE-bench Lite dataset loaders (JSONL + HuggingFace).
eval/swebench/efficiency.py Adds efficiency metric extraction from run artifacts plus report formatting.
eval/swebench/grader.py Adds local + official SWE-bench grading utilities.
eval/swebench/prompt.py Adds structured goal/custom instruction prompt builders for SWE-bench instances.
eval/swebench/report.py Adds leaderboard/comparison/per-repo report formatting for SWE-bench runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread eval/swebench/config.py Outdated
Comment on lines +12 to +13
from dataclasses import dataclass, field
from typing import Any

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field is imported from dataclasses but not used in this file. Removing unused imports helps keep the config builder focused and avoids potential lint noise if this directory is added to ruff checks.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +39
- name: Run benchmarks
run: |
uv run python scripts/benchmark_ci.py \
--repos attocode \
--num-runs 3 \
--no-fail-on-regression \
--output-json benchmark_results.json \
--output-comment benchmark_comment.md

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow runs python scripts/benchmark_ci.py, but that file does not exist in the repository (only scripts/setup.sh is present). As written, the workflow will fail immediately; either add the missing benchmark script (and ensure it produces benchmark_results.json/benchmark_comment.md) or update the workflow to call the correct existing entrypoint.

Copilot uses AI. Check for mistakes.
Comment thread eval/swebench/config.py
Comment on lines +76 to +88
"orchestration": {
"decomposition": cfg.decomposition_mode,
"max_tasks": cfg.max_tasks,
"max_depth": cfg.max_depth,
"custom_instructions": cfg.custom_instructions,
},
"retry": {
"max_task_attempts": cfg.max_task_attempts,
},
"workspace": {
"mode": cfg.workspace_mode,
},
}

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_swarm_yaml_dict emits a retry section, but attoswarm’s config loader reads retries (see src/attoswarm/config/loader.py). This means max_task_attempts from SWEBenchEvalConfig will be silently ignored and defaults will apply. Rename the section to retries (and consider mapping max_concurrent_agents to workspace.max_concurrent_writers so concurrency is applied consistently).

Copilot uses AI. Check for mistakes.
Comment thread eval/swebench/adapter.py Outdated
Comment on lines +161 to +176
# Try to load state.json from run directory
state_path = os.path.join(working_dir, ".swarm-run", "state.json")
state: dict[str, Any] = {}
if os.path.exists(state_path):
with open(state_path) as f:
state = json.load(f)

return {
"completed": proc.returncode == 0,
"summary": output[:5000],
"tokens_used": state.get("budget", {}).get("tokens_used", 0),
"cost_usd": state.get("budget", {}).get("cost_usd", 0.0),
"tasks_completed": state.get("tasks_completed", 0),
"tasks_total": state.get("tasks_total", 0),
"tool_calls": 0,
"phase": state.get("phase", "unknown"),

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapter tries to load attoswarm state from .swarm-run/state.json, but attoswarm persists swarm.state.json in config.run.run_dir (and the persisted events file is swarm.events.jsonl). With the current path, state will usually stay empty and tokens_used/cost_usd/task counts will report as 0 even on successful runs. Update the filenames/paths to match attoswarm’s run layout.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +112
def extract_efficiency(run_dir: str) -> EfficiencyMetrics:
"""Extract efficiency metrics from a swarm run directory.

Expected structure:
run_dir/
state.json — final orchestrator state
events.jsonl — event log
manifest.json — task manifest
tasks/ — per-task results
"""
metrics = EfficiencyMetrics()

state = _load_json(os.path.join(run_dir, "state.json"))
if not state:
return metrics

metrics.run_id = state.get("run_id", "")

# Task counts
tasks = state.get("tasks", {})
if isinstance(tasks, dict):
task_list = list(tasks.values())
elif isinstance(tasks, list):
task_list = tasks
else:
task_list = []

metrics.total_tasks = len(task_list)
metrics.completed_tasks = sum(
1 for t in task_list
if t.get("status") in ("done", "completed")
)
metrics.failed_tasks = sum(
1 for t in task_list
if t.get("status") in ("failed", "error", "skipped")
)

# Task completion rate
if metrics.total_tasks > 0:
metrics.task_completion_rate = metrics.completed_tasks / metrics.total_tasks

# Budget
budget = state.get("budget", {})
metrics.tokens_budgeted = budget.get("max_tokens", 0)
metrics.tokens_used = budget.get("tokens_used", 0)
if metrics.tokens_budgeted > 0:
metrics.budget_accuracy = metrics.tokens_used / metrics.tokens_budgeted

# Wall time
metrics.wall_time_seconds = state.get("elapsed_s", 0.0)

# Max concurrency from DAG summary
dag = state.get("dag_summary", {})
metrics.max_concurrency = dag.get("max_parallelism", 1) or 1

# Process events for parallelism and retry metrics
events = _load_events(os.path.join(run_dir, "events.jsonl"))
if events:
_process_events(metrics, events)

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_efficiency expects state.json and events.jsonl with fields like type and agent_id, but attoswarm’s persisted files are swarm.state.json and swarm.events.jsonl and use different keys (e.g. event_type, agent_id, task_id, data). As a result this will commonly return all-zero metrics. Align the expected filenames and parsing logic with attoswarm’s default_run_layout() and EventBus JSONL format.

Copilot uses AI. Check for mistakes.
Comment thread eval/swebench/cli.py
Comment on lines +248 to +266
def cmd_efficiency(args: argparse.Namespace) -> None:
"""Analyze swarm efficiency for a run."""
run_dir = args.run_dir
if not run_dir:
# Try to find run directory from default locations
candidates = [
os.path.join(".agent", args.run_id),
os.path.join("/tmp", f"attocode-eval-{args.run_id}"),
]
for c in candidates:
if os.path.isdir(c):
run_dir = c
break

if not run_dir or not os.path.isdir(run_dir):
print(f"Run directory not found for: {args.run_id}")
print("Use --run-dir to specify the path")
sys.exit(1)

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmd_efficiency tries default locations like .agent/<run_id> and /tmp/attocode-eval-<run_id>, but SWE-bench runs via EvalHarness create per-instance working dirs under a temp attocode-eval-* root, and the attoswarm run dir is configured as <instance_dir>/.swarm-run. With the current logic, --run-id is unlikely to resolve to an efficiency report without manually passing --run-dir. Consider resolving the run dirs from ResultsDB.get_run_results(...)[].metadata['working_dir'] and aggregating via extract_efficiency_batch.

Copilot uses AI. Check for mistakes.
Comment thread eval/swebench/report.py
Comment on lines +160 to +164
total = len(repo_results)
passed = sum(1 for r in repo_results if r.status == InstanceStatus.PASSED)
failed = sum(1 for r in repo_results if r.status == InstanceStatus.FAILED)
rate = passed / total if total > 0 else 0
lines.append(f"| {repo} | {total} | {passed} | {failed} | {rate:.1%} |")

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-repo breakdown counts failed only when status == InstanceStatus.FAILED, so instances with ERROR, TIMEOUT, or SKIPPED won’t be reflected in the table (and passed + failed may be less than total). Consider either adding columns for other statuses or treating any non-PASSED status as failed for reporting purposes.

Copilot uses AI. Check for mistakes.
Comment thread eval/swebench/report.py Outdated
Comment on lines +9 to +16
import json
import os
from dataclasses import dataclass, field
from typing import Any

from eval.harness import ResultsDB, RunResult, InstanceStatus
from eval.metrics import compute_metrics, compare_runs, format_comparison, EvalMetrics
from eval.swebench.efficiency import EfficiencyMetrics

Copilot AI Mar 10, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module has several unused imports (json, os, field, ResultsDB, EvalMetrics), which can obscure dependencies and may fail local linting if eval/ is added to ruff checks later. Please remove unused imports (or use them if intended).

Copilot uses AI. Check for mistakes.
eren23 added 3 commits March 10, 2026 20:03
- Added optional instance binding in EvalHarness for agent factories requiring full context.
- Updated AttoswarmSWEBenchFactory to support run directory configuration and improved environment variable handling.
- Enhanced efficiency metrics extraction to accommodate new swarm state file structure and improved task counting logic.
- Modified CLI to allow specifying a database for efficiency analysis and improved run directory resolution.
- Refactored report generation to treat non-passed terminal states as failures for clearer reporting.

These changes improve the flexibility and accuracy of the SWE-bench evaluation process, enhancing user experience and reporting clarity.
- Updated exception handling in multiple files to catch `TimeoutError` directly instead of `asyncio.TimeoutError`.
- Enhanced readability by simplifying conditional checks and variable assignments in various functions.
- Improved import organization and removed unnecessary lines to streamline code structure.
- Added type hints for better clarity and maintainability.

These changes enhance code clarity and maintainability while ensuring consistent error handling across the codebase.
@eren23 eren23 merged commit 5cf32e2 into main Mar 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants