Fixed-budget multi-agent inference benchmark harness for Takahashi (2026), with matched strong single-agent baselines, local context ceilings, and topology-sensitive diagnostics.
Runs locally with Ollama gemma3:1b on CPU to test when split inference helps, hurts, or fails under fair budgets; the TeX source in papers/when_should_inference_be_split.tex remains the canonical specification.
The default execution target is local-only and CPU-feasible:
- local Ollama only
- default model
gemma3:1b - default base URL
http://localhost:11434/api - small synthetic pilots first
- no cloud provider assumptions
The first public version focuses on low-compute, falsifiable pilot experiments that preserve the paper's distinctions:
q_x(A)versusu_x^{dep}(A)are reported separately.- Structural claims, tractable-surrogate claims, proxy-certified diagnostics, and design rules are tagged separately.
- Metrics are tagged as
observable,proxy-certifiable, orlatent-unidentified. - Matched-budget fairness is enforced against a strong single-workspace baseline.
- The local context ceiling
W_maxis separate from the additive priced budgetR_add = (C, M, L, V, H).
The runnable pilots currently emphasize synthetic interaction tasks because they let the harness probe decomposition, topology sensitivity, relay loss, shared-memory recovery, replication failure, verification bottlenecks, and tool-vs-no-tool effects under exact instrumentation. Lightweight math-style and code-style adapters are included as scaffolds for tiny pilot subsets, not as official benchmark claims.
The local Ollama adapter is implemented in src/llm/ollama_client.py. The upgraded live path now uses structure-dependent prompts, a deterministic local calculator, finer-grained verification effort, task-level failure labels, and machine-generated task diff reports.
The upgraded live pilots were re-run on March 10, 2026 with outputs under:
results/runs/20260310_170449_live_synthetic_v3results/runs/20260310_170449_live_verification_v3results/runs/20260310_170449_live_topology_v3results/runs/20260310_170449_live_tool_v1results/runs/20260310_170449_live_engineer_v2
High-level takeaways from those runs:
- strong single-workspace remained highly competitive and tied or led in the main synthetic and topology pilots,
- balanced heterogeneous verification beat low, medium-low, and high verification on the current verification-heavy live pack,
- explicit calculator access was decisive on arithmetic-sensitive live tasks,
- the current shared-memory live implementation failed badly on the interaction-heavy topology pilot and repeated stale-reuse or prompt-bloat labels now make that failure inspectable,
- the recalibrated engineer-style structured tasks are now partially solvable, but only in a narrow heterogeneous regime.
See result_summary.md for the live-only interpretation.
- Real frontier-model provider integrations.
- Full empirical validation of the paper's latent structural quantities on open benchmarks.
- Universal theorems from synthetic pilots.
- Official GSM8K, MBPP, or HumanEval asset redistribution.
- Large or expensive
gemma3:1bsweeps. - Any cloud-only orchestration workflow.
Every comparative experiment is intended to hold fixed:
- the same priced additive budget,
- the same local context ceiling,
- the same tool access,
- the same note or scratchpad privilege,
- the same deadline or latency constraints when applicable.
The single-workspace comparator is strong by design: it can use retries, self-consistency, and self-critique within the same priced budget.
The only intentional exception is the explicit tool-ablation pilot, where tool access is the manipulated variable and the config marks same_tool_access: false.
Install:
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]Run the smoke test:
bash scripts/check_ollama.sh
bash scripts/run_smoke_test.shRun the minimal live Ollama smoke experiment:
bash scripts/run_live_smoke_test.shRun the live synthetic pilot:
bash scripts/run_live_synthetic_pilot.shRun the live verification-budget pilot:
bash scripts/run_live_verification_pilot.shRun the live topology pilot:
bash scripts/run_live_topology_pilot.shRun the live tool-ablation pilot:
bash scripts/run_live_tool_pilot.shRun the live engineer-task pilot:
bash scripts/run_live_engineer_pilot.shRun the synthetic pilot:
bash scripts/run_synthetic_pilot.shRun the verification-budget pilot:
bash scripts/run_verification_pilot.shRun the topology pilot:
bash scripts/run_topology_pilot.shRun the tiny benchmark-style scaffold:
bash scripts/run_small_benchmark_pilot.shCreate summaries:
bash scripts/make_summary_tables.sh results/runs/synthetic_pilot
bash scripts/make_figures.sh results/runs/synthetic_pilotEach run directory now includes:
summary.jsonsummary_recomputed.jsonfamily_summary.jsonfailure_summary.jsontask_diffs.jsonltask_diffs.csvtask_diffs.mdsummary.png
Please cite the paper:
Takahashi, K. (2026). When Should Inference Be Split? A Fixed-Budget Theory of Predictable Multi-Agent Advantage under Local Context Ceilings. Zenodo. https://doi.org/10.5281/zenodo.18932509