Skip to content

kadubon/split-inference-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

split-inference-bench

Fixed-budget multi-agent inference benchmark harness for Takahashi (2026), with matched strong single-agent baselines, local context ceilings, and topology-sensitive diagnostics. Runs locally with Ollama gemma3:1b on CPU to test when split inference helps, hurts, or fails under fair budgets; the TeX source in papers/when_should_inference_be_split.tex remains the canonical specification.

The default execution target is local-only and CPU-feasible:

  • local Ollama only
  • default model gemma3:1b
  • default base URL http://localhost:11434/api
  • small synthetic pilots first
  • no cloud provider assumptions

What This Repo Tests

The first public version focuses on low-compute, falsifiable pilot experiments that preserve the paper's distinctions:

  • q_x(A) versus u_x^{dep}(A) are reported separately.
  • Structural claims, tractable-surrogate claims, proxy-certified diagnostics, and design rules are tagged separately.
  • Metrics are tagged as observable, proxy-certifiable, or latent-unidentified.
  • Matched-budget fairness is enforced against a strong single-workspace baseline.
  • The local context ceiling W_max is separate from the additive priced budget R_add = (C, M, L, V, H).

The runnable pilots currently emphasize synthetic interaction tasks because they let the harness probe decomposition, topology sensitivity, relay loss, shared-memory recovery, replication failure, verification bottlenecks, and tool-vs-no-tool effects under exact instrumentation. Lightweight math-style and code-style adapters are included as scaffolds for tiny pilot subsets, not as official benchmark claims.

The local Ollama adapter is implemented in src/llm/ollama_client.py. The upgraded live path now uses structure-dependent prompts, a deterministic local calculator, finer-grained verification effort, task-level failure labels, and machine-generated task diff reports.

Current Live Snapshot

The upgraded live pilots were re-run on March 10, 2026 with outputs under:

  • results/runs/20260310_170449_live_synthetic_v3
  • results/runs/20260310_170449_live_verification_v3
  • results/runs/20260310_170449_live_topology_v3
  • results/runs/20260310_170449_live_tool_v1
  • results/runs/20260310_170449_live_engineer_v2

High-level takeaways from those runs:

  • strong single-workspace remained highly competitive and tied or led in the main synthetic and topology pilots,
  • balanced heterogeneous verification beat low, medium-low, and high verification on the current verification-heavy live pack,
  • explicit calculator access was decisive on arithmetic-sensitive live tasks,
  • the current shared-memory live implementation failed badly on the interaction-heavy topology pilot and repeated stale-reuse or prompt-bloat labels now make that failure inspectable,
  • the recalibrated engineer-style structured tasks are now partially solvable, but only in a narrow heterogeneous regime.

See result_summary.md for the live-only interpretation.

What This Repo Does Not Yet Test

  • Real frontier-model provider integrations.
  • Full empirical validation of the paper's latent structural quantities on open benchmarks.
  • Universal theorems from synthetic pilots.
  • Official GSM8K, MBPP, or HumanEval asset redistribution.
  • Large or expensive gemma3:1b sweeps.
  • Any cloud-only orchestration workflow.

Repo Layout

Fairness Rules

Every comparative experiment is intended to hold fixed:

  • the same priced additive budget,
  • the same local context ceiling,
  • the same tool access,
  • the same note or scratchpad privilege,
  • the same deadline or latency constraints when applicable.

The single-workspace comparator is strong by design: it can use retries, self-consistency, and self-critique within the same priced budget.

The only intentional exception is the explicit tool-ablation pilot, where tool access is the manipulated variable and the config marks same_tool_access: false.

Running A Pilot

Install:

python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]

Run the smoke test:

bash scripts/check_ollama.sh
bash scripts/run_smoke_test.sh

Run the minimal live Ollama smoke experiment:

bash scripts/run_live_smoke_test.sh

Run the live synthetic pilot:

bash scripts/run_live_synthetic_pilot.sh

Run the live verification-budget pilot:

bash scripts/run_live_verification_pilot.sh

Run the live topology pilot:

bash scripts/run_live_topology_pilot.sh

Run the live tool-ablation pilot:

bash scripts/run_live_tool_pilot.sh

Run the live engineer-task pilot:

bash scripts/run_live_engineer_pilot.sh

Run the synthetic pilot:

bash scripts/run_synthetic_pilot.sh

Run the verification-budget pilot:

bash scripts/run_verification_pilot.sh

Run the topology pilot:

bash scripts/run_topology_pilot.sh

Run the tiny benchmark-style scaffold:

bash scripts/run_small_benchmark_pilot.sh

Create summaries:

bash scripts/make_summary_tables.sh results/runs/synthetic_pilot
bash scripts/make_figures.sh results/runs/synthetic_pilot

Each run directory now includes:

  • summary.json
  • summary_recomputed.json
  • family_summary.json
  • failure_summary.json
  • task_diffs.jsonl
  • task_diffs.csv
  • task_diffs.md
  • summary.png

Citation

Please cite the paper:

Takahashi, K. (2026). When Should Inference Be Split? A Fixed-Budget Theory of Predictable Multi-Agent Advantage under Local Context Ceilings. Zenodo. https://doi.org/10.5281/zenodo.18932509

Documentation

About

Fixed-budget multi-agent inference benchmark harness for studying when split inference helps or hurts versus a strong single-agent baseline under local context ceilings, using local Ollama gemma3:1b, CPU-only pilots, topology diagnostics, verification-budget analysis, and reproducible experiment logging.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors