split-inference-bench

Fixed-budget multi-agent inference benchmark harness for Takahashi (2026), with matched strong single-agent baselines, local context ceilings, and topology-sensitive diagnostics. Runs locally with Ollama gemma3:1b on CPU to test when split inference helps, hurts, or fails under fair budgets; the TeX source in papers/when_should_inference_be_split.tex remains the canonical specification.

The default execution target is local-only and CPU-feasible:

local Ollama only
default model gemma3:1b
default base URL http://localhost:11434/api
small synthetic pilots first
no cloud provider assumptions

What This Repo Tests

The first public version focuses on low-compute, falsifiable pilot experiments that preserve the paper's distinctions:

q_x(A) versus u_x^{dep}(A) are reported separately.
Structural claims, tractable-surrogate claims, proxy-certified diagnostics, and design rules are tagged separately.
Metrics are tagged as observable, proxy-certifiable, or latent-unidentified.
Matched-budget fairness is enforced against a strong single-workspace baseline.
The local context ceiling W_max is separate from the additive priced budget R_add = (C, M, L, V, H).

The runnable pilots currently emphasize synthetic interaction tasks because they let the harness probe decomposition, topology sensitivity, relay loss, shared-memory recovery, replication failure, verification bottlenecks, and tool-vs-no-tool effects under exact instrumentation. Lightweight math-style and code-style adapters are included as scaffolds for tiny pilot subsets, not as official benchmark claims.

The local Ollama adapter is implemented in src/llm/ollama_client.py. The upgraded live path now uses structure-dependent prompts, a deterministic local calculator, finer-grained verification effort, task-level failure labels, and machine-generated task diff reports.

Current Live Snapshot

The upgraded live pilots were re-run on March 10, 2026 with outputs under:

results/runs/20260310_170449_live_synthetic_v3
results/runs/20260310_170449_live_verification_v3
results/runs/20260310_170449_live_topology_v3
results/runs/20260310_170449_live_tool_v1
results/runs/20260310_170449_live_engineer_v2

High-level takeaways from those runs:

strong single-workspace remained highly competitive and tied or led in the main synthetic and topology pilots,
balanced heterogeneous verification beat low, medium-low, and high verification on the current verification-heavy live pack,
explicit calculator access was decisive on arithmetic-sensitive live tasks,
the current shared-memory live implementation failed badly on the interaction-heavy topology pilot and repeated stale-reuse or prompt-bloat labels now make that failure inspectable,
the recalibrated engineer-style structured tasks are now partially solvable, but only in a narrow heterogeneous regime.

See result_summary.md for the live-only interpretation.

What This Repo Does Not Yet Test

Real frontier-model provider integrations.
Full empirical validation of the paper's latent structural quantities on open benchmarks.
Universal theorems from synthetic pilots.
Official GSM8K, MBPP, or HumanEval asset redistribution.
Large or expensive gemma3:1b sweeps.
Any cloud-only orchestration workflow.

Repo Layout

Fairness Rules

Every comparative experiment is intended to hold fixed:

the same priced additive budget,
the same local context ceiling,
the same tool access,
the same note or scratchpad privilege,
the same deadline or latency constraints when applicable.

The single-workspace comparator is strong by design: it can use retries, self-consistency, and self-critique within the same priced budget.

The only intentional exception is the explicit tool-ablation pilot, where tool access is the manipulated variable and the config marks same_tool_access: false.

Running A Pilot

Install:

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Run the smoke test:

bash scripts/check_ollama.sh
bash scripts/run_smoke_test.sh

Run the minimal live Ollama smoke experiment:

bash scripts/run_live_smoke_test.sh

Run the live synthetic pilot:

bash scripts/run_live_synthetic_pilot.sh

Run the live verification-budget pilot:

bash scripts/run_live_verification_pilot.sh

Run the live topology pilot:

bash scripts/run_live_topology_pilot.sh

Run the live tool-ablation pilot:

bash scripts/run_live_tool_pilot.sh

Run the live engineer-task pilot:

bash scripts/run_live_engineer_pilot.sh

Run the synthetic pilot:

bash scripts/run_synthetic_pilot.sh

Run the verification-budget pilot:

bash scripts/run_verification_pilot.sh

Run the topology pilot:

bash scripts/run_topology_pilot.sh

Run the tiny benchmark-style scaffold:

bash scripts/run_small_benchmark_pilot.sh

Create summaries:

bash scripts/make_summary_tables.sh results/runs/synthetic_pilot
bash scripts/make_figures.sh results/runs/synthetic_pilot

Each run directory now includes:

summary.json
summary_recomputed.json
family_summary.json
failure_summary.json
task_diffs.jsonl
task_diffs.csv
task_diffs.md
summary.png

Citation

Please cite the paper:

Takahashi, K. (2026). When Should Inference Be Split? A Fixed-Budget Theory of Predictable Multi-Agent Advantage under Local Context Ceilings. Zenodo. https://doi.org/10.5281/zenodo.18932509

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data/samples		data/samples
docs		docs
papers		papers
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
result_summary.md		result_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

split-inference-bench

What This Repo Tests

Current Live Snapshot

What This Repo Does Not Yet Test

Repo Layout

Fairness Rules

Running A Pilot

Citation

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

split-inference-bench

What This Repo Tests

Current Live Snapshot

What This Repo Does Not Yet Test

Repo Layout

Fairness Rules

Running A Pilot

Citation

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages