- Overview
- Motivation
- Architecture
- Installation
- Quick Start
- Core Concepts
- Code Examples
- Configuration File Reference
- Documentation
- Contributing
- Citing
DikeBenchmarker is a Python framework for conducting sustainable solver competitions. Instead of evaluating every solver on every benchmark instance—a combinatorial bottleneck in large-scale competitions—DikeBenchmarker intelligently selects which instances to run and when to stop, while preserving statistical confidence in the final solver ranking.
The framework is designed around the following principles:
- Modularity: All components (instance selectors, stopping criteria, data adaptors, infrastructure runners) are interchangeable via clean abstract interfaces.
- Sustainability: Dramatically reduce total CPU-hours consumed by solver competitions without sacrificing ranking accuracy.
- Statistical rigor: Stopping decisions are grounded in non-parametric statistical tests (Wilcoxon signed-rank) and information-theoretic criteria.
- Scalability: Native support for both local multi-process execution and distributed SLURM cluster execution via Parsl.
The project is developed by the ADA Research Group and targets the SAT competition community, though the architecture generalises to any solver benchmarking domain.
Modern solver competitions (e.g., the SAT Competition) evaluate dozens of solvers across thousands of benchmark instances, consuming millions of CPU-hours. The key observation motivating DikeBenchmarker is:
Not all instances are equally informative. Many instances are trivially solved by every solver, while a small subset of discriminating instances carry the majority of the information needed to rank solvers.
DikeBenchmarker exploits historical performance data to:
- Rank instances by their expected discriminating power (discrimination-based or variance-based selection).
- Adapt stopping once sufficient statistical confidence in the ranking has been achieved (Wilcoxon test, minimum accuracy, or a fixed percentage).
- Reduce cost by skipping redundant evaluations without distorting the competition ranking.
The result is a benchmarking loop that runs only as many solver–instance pairs as are necessary to produce a reliable ranking, potentially reducing competition cost by an order of magnitude.
Job lifecycle:
CREATED ──► SUBMITTED ──► RUNNING ──► FINISHED
│ └──► FAILED
└──────────────────► CANCELLED
The repository ships a bundled SQLite database containing historical SAT competition performance data. To restore it:
-
Clone the repository with submodules:
git clone --recurse-submodules https://github.com/ADA-research/DikeBenchmarker.git cd DikeBenchmarker -
Ensure
sqlite3is available on your system:sqlite3 --version
-
Run the setup script to configure git filters and restore the database:
cd src/DikeBenchmarker/data/db ./setup.sh cd ../../../..
-
The database will be automatically restored from the SQL dump.
-
Install the package (using uv or pip):
# with uv (recommended) uv sync # or with pip pip install -e .
If you prefer to work from your own data files or from data originating from GBD, submodules are not required:
git clone https://github.com/ADA-research/DikeBenchmarker.git
cd DikeBenchmarker
pip install -e .git clone --recurse-submodules https://github.com/ADA-research/DikeBenchmarker.git
cd DikeBenchmarker
uv sync --group devThis installs additional dependencies for testing (pytest, pytest-dependency) and linting (ruff).
The minimal example below benchmarks a new solver against a SAT competition field using random instance selection and a Wilcoxon-based stopping criterion:
import importlib.resources
from DikeBenchmarker.benchmarkadaptors.satinstance import SATInstanceAdaptor
from DikeBenchmarker.benchmarkingmethods.benchmarker import Benchmarker
from DikeBenchmarker.benchmarkingmethods.instance_selectors.random_instance_selector import RandomInstanceSelector
from DikeBenchmarker.benchmarkingmethods.stopping_criterion.wilcoxon_stopping_criterion import WilcoxonStoppingCriterion
from DikeBenchmarker.dataadaptors.sqlite_dataadaptor import SqlDataAdaptor
from DikeBenchmarker.infrastructureadaptors.local_runner import LocalRunner
from DikeBenchmarker.solveradaptors.solveradaptor import SolverAdaptor
ccompetition = "main2024"
solver_id = "my_solver_id"
confidence = 0.95
# Load prior data from existing database
adap = SqlDataAdaptor("my_db.db")
# Load prior data from database (here a competition)
solvers_challenged = adap.get_competition_solver_id(competition)
benchmark_ids = adap.get_competition_instance_id(competition)
# Create benchmarking method
benchmarker = Benchmarker(
RandomInstanceSelector(benchmark_ids, solver_id, seed=1),
WilcoxonStoppingCriterion(benchmark_ids, solver_id,
solvers_challenged, confidence, adap),
benchmark_ids,
solver_id)
# Load files locally
solver_adaptor = SolverAdaptor()
solver_adaptor.read_registry("./solvers/sat/")
instance_adaptor = SATInstanceAdaptor("./instances/sat/",
"./instances/cnf_data.db")
# Now make a runner so that we actually run something
runner = LocalRunner(solver_adaptor, instance_adaptor, parallel=1)
runner.run([benchmarker])An InstanceSelector determines the order in which benchmark instances are presented to the solver. All selectors implement the InstanceSelector abstract base class and expose:
next_benchmark_id() → str— return the next instance hash to runhandle_result(result: Result) → None— update internal state after a result arrives
| Selector | Key Idea | When to Use |
|---|---|---|
TrivialInstanceSelector |
Sequential, all instances | Baseline / allpairs mode |
RandomInstanceSelector |
Randomly shuffled (optional seed) | Unbiased baselines |
DiscriminationInstanceSelector |
Prioritises instances with high discrimination score ρ | Historical data available |
VarianceInstanceSelector |
Prioritises instances with high variance-to-mean runtime ratio | Historical data available |
Discrimination score for instance i:
score(i) = |{s : perf(s,i) ≥ ρ · best_perf(i)}| / mean_perf(i)
Lower scores surface instances that are hardest and most differentiating across solvers.
Variance score for instance i:
score(i) = Var{perf(s,i) : s ∈ S} / mean{perf(s,i) : s ∈ S}
Instances where solvers disagree most (high coefficient of variation) are prioritised.
A StoppingCriteria object decides when to halt the benchmarking loop. Criteria implement:
should_stop() → boolhandle_result(result: Result) → Nonejob_submitted(job: Job) → None
Criteria can be composed with boolean operators:
# Stop when EITHER criterion is satisfied
combined = WilcoxonStoppingCriterion(...) | PercentageStoppingCriterion(benchmark_ids, 0.5)
# Stop only when BOTH are satisfied
combined = MinimumAccuracyStoppingCriterion(...) & PercentageStoppingCriterion(benchmark_ids, 0.1)| Criterion | Behaviour | Threshold Meaning |
|---|---|---|
NoStoppingCriteria |
Never stops; runs all selected instances | — |
PercentageStoppingCriterion |
Stops after p% of instances are evaluated | e.g., 0.20 → 20% |
MinimumAccuracyStoppingCriterion |
Stops when ranking accuracy vs. ground truth exceeds threshold | e.g., 0.95 → 95% |
WilcoxonStoppingCriterion |
Uses a two-sided Wilcoxon signed-rank test to confirm statistical separation | e.g., 0.95 → confidence level |
Data adaptors provide historical solver performance data used by benchmarking methods.
| Adaptor | Description |
|---|---|
SqlDataAdaptor |
Queries the bundled SQLite database (recommended) |
CSVDataAdaptor |
Reads performance data from CSV files |
InMemoryDataAdaptor |
In-memory dict-based adaptor for testing and simulation |
MergeDataAdaptor |
Merges results from multiple adaptors |
CompetitionDataAdaptor |
Filters a parent adaptor to a specific competition scope |
Fetching the bundled database path:
import importlib.resources
db_path = str(
importlib.resources.files("DikeBenchmarker.data.db")
.joinpath("sustainablecompetition.db")
)Key SqlDataAdaptor methods:
adap = SqlDataAdaptor(db_path)
# Retrieve all solver IDs that participated in a given competition
solvers = adap.get_competition_solver_id("main2024")
# Retrieve all instance hashes used in a competition
instances = adap.get_competition_instance_id("main2024")
# Query the raw performance table (returns a Polars DataFrame)
df = adap.get_performances(instance_hashes=instances, solver_ids=solvers)
# Bulk-load all (instance, solver) → runtime pairs at once (fast)
cost_lookup = adap.bulk_load_performances(instances, solvers)Runners orchestrate the submission, monitoring, and result collection of benchmark jobs.
| Runner | Backend | Use Case |
|---|---|---|
LocalRunner |
ProcessPoolExecutor |
Development, small experiments |
ParslRunner |
Parsl (SLURM / cloud) | Large-scale cluster competitions |
VirtualRunner |
In-memory simulation | Testing, offline analysis |
LocalRunner:
from DikeBenchmarker.infrastructureadaptors.local_runner import LocalRunner
runner = LocalRunner(solver_adaptor, instance_adaptor, parallel=8)
runner.run([benchmarker])ParslRunner (SLURM): configured via the YAML scheduling block (see Configuration).
VirtualRunner replays historical performance data without executing any process:
from DikeBenchmarker.infrastructureadaptors.virtual_runner import VirtualRunner
from DikeBenchmarker.dataadaptors.inmemory_dataadaptor import InMemoryDataAdaptor
adaptor = InMemoryDataAdaptor(cost_lookup)
runner = VirtualRunner(adaptor)
runner.run([benchmarker])Result consumers process completed Result objects asynchronously in a dedicated daemon thread, decoupled from job submission.
from DikeBenchmarker.resultconsumers.csv_consumer import CSVConsumer
from DikeBenchmarker.resultconsumers.lambda_consumer import LambdaConsumer
# Write every result to a CSV file
benchmarker.register_consumer(CSVConsumer("./results/output.csv"))
# Or apply a custom callback
benchmarker.register_consumer(LambdaConsumer(lambda r: print(r.get_job(), r.has_failed())))
# Chain multiple consumers
benchmarker.register_consumer(csv_consumer.then(lambda_consumer))DikeBenchmarker can be driven via a YAML configuration file (used by dike.py). Full documentation is available at docs/source/configuration.rst.
# paths are relative to the config file directory
benchmarks:
file: ./instances.csv # CSV with a 'hash' column per row
selection_method: discrimination-based # allpairs | random | discrimination-based | variance-based
stopping_criterion: wilcoxon # none | minimum-accuracy | percentage | wilcoxon
stopping_threshold: 0.95 # meaning depends on criterion (see below)
solvers: ./solvers.csv # semicolon-delimited solver registry
results: ./results/ # output directory
solver_cputime: 5000 # CPU time limit for solver (seconds)
solver_walltime: 7000 # Wall time limit for solver (seconds)
solver_memory: 65536 # Memory limit for solver (KB)
checker_cputime: 45000
checker_walltime: 70000
checker_memory: 65536
scheduling:
scheduler: slurm # slurm | local
# SLURM options
machine: cpu-partition
account: myaccount
tasks_per_node: 32
jobname: benchmark
workerinit: "module load gcc/12"
queuelimit: 100
# Local options (when scheduler: local)
# parallel: 8Solver registry format (solvers.csv):
id;bin;fmt;checker
kissat-sc2025;./solvers/kissat;$BIN0 "$INST" "$CERT";gratbin
cadical-sc2025;./solvers/cadical;$BIN0 "$INST" "$CERT";gratbin
Placeholders in fmt:
$BIN0,$BIN1, … — paths to solver binaries (in registry order)$INST— path to the benchmark instance file$CERT— path for proof/certificate output
Benchmark instance CSV format (instances.csv):
hash
00d5a43a481477fa4d56a2ce152a6cfb
00fd8ac9acd186a7a78a2c4d92f90de1
0205e2dffaef93a90c239df31755f2e1
Full API documentation is built with Sphinx and hosted on GitHub Pages:
To build the docs locally:
cd docs
make html
# Open docs/build/html/index.html in your browserThe documentation covers:
- Full API reference for all public classes and methods
- Configuration file schema
- Data adaptor interface specification
- Infrastructure runner extension guide
Contributions are welcome. Please follow the workflow below.
Open an issue on GitHub Issues and include:
- A minimal reproducible example
- Python version (
python --version) - Package version (
pip show DikeBenchmarker) - Full traceback if applicable
# 1. Fork and clone
git clone --recurse-submodules https://github.com/<your-fork>/DikeBenchmarker.git
cd DikeBenchmarker
# 2. Create a feature branch
git checkout -b feature/my-new-selector
# 3. Install in editable mode with dev dependencies
uv sync --group dev
# 4. Restore the database
cd src/DikeBenchmarker/data/db && ./setup.sh && cd ../../../..
# 5. Make your changes, then run the test suite
pytest tests/
# 6. Check code style
ruff check src/ tests/
# 7. Open a pull request against mainThe framework is designed for extensibility. Common extension points:
Custom instance selector:
from DikeBenchmarker.benchmarkingmethods.instance_selectors.instance_selector import InstanceSelector
from DikeBenchmarker.benchmarkatoms import Result
class MySelector(InstanceSelector):
def next_benchmark_id(self) -> str:
# return the hash of the next instance to evaluate
...
def handle_result(self, result: Result) -> None:
# update selector state after a result arrives
...Custom stopping criterion:
from DikeBenchmarker.benchmarkingmethods.stopping_criterion.stopping_criterion import StoppingCriteria
from DikeBenchmarker.benchmarkatoms import Result, Job
class MyStoppingCriterion(StoppingCriteria):
def should_stop(self) -> bool:
...
def handle_result(self, result: Result) -> None:
...
def job_submitted(self, job: Job) -> None:
...Custom result consumer:
from DikeBenchmarker.resultconsumers.abstract_consumer import AbstractConsumer
from DikeBenchmarker.benchmarkatoms import Result
class MyConsumer(AbstractConsumer):
def consume_result(self, result: Result) -> None:
...All new components should include unit tests under tests/ following the existing conventions.
The project uses Ruff for linting and formatting with Google-style docstrings. Maximum line length is 160 characters. Run ruff check src/ tests/ before opening a pull request.
If you use DikeBenchmarker in your research, please cite the following:
@software{iser2026dike,
title = {{DIKE}: Data-Informed Knowledge-Driven Evaluation for Sustainable Solver Competitions},
author = {Iser, Ashlin and Anastacio, Marie and Matricon, Th\'{e}o},
year = {2026},
url = {https://github.com/ADA-research/DikeBenchmarker},
}This project is licensed under the terms of the LICENSE file included in this repository.
Developed by the ADA Research Group — Ashlin Iser, Marie Anastacio, Théo Matricon.