Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells by danielmeppiel · Pull Request #12 · danielmeppiel/genesis

danielmeppiel · 2026-05-29T10:48:08Z

Headline

This PR makes token economics a first-class design dimension in the genesis skill (v0.2 → v0.3.6). A new vocabulary of patterns lets the architect reason about which model each sub-agent runs on, which tools it can call, and how big each prompt needs to be — not just whether the workflow is correct.

We validated this on three workloads against four execution shapes each (twelve cells total), with every dollar harvested from the Copilot cloud events table — no estimates, no models. The headline is honest:

v0.3.6 is insurance, not optimisation. It pays back on workloads where missing a finding is expensive — supply-chain security, anti-pattern rejection, verifier-confirmed audits. It does not pay back on simple workloads where a single Sonnet pass already produces good output.

v0.3.7 update (this PR's headline addition). The B14b CAVEMAN BRIEF + B14c AUDIENCE BOUNDARY substrate is now empirically load-bearing. A cold-start fidelity test (Opus architect → fresh Sonnet executor, packet-only, no skill loaded) produces 5/5 CAVEMAN_FULL spawn briefs on the wire for S1's 5-lens panel (vs v0.3.6's 0/9), and the cost effect at quality parity is −24.6% on S1, −27.5% on S2N with real telemetry. S3 substrate path is byte-identical to v0.3.6; its +37.5% cell number is documented fixture-rebuild overhead. Detail in § v0.3.7 cold-start packet fidelity test.

The next three sections explain why that conclusion is the right one, what the experiment actually measured, and how to read the per-cell numbers.

TL;DR — when to pick which

If you need...	Pick	Why (one-line evidence)
A deterministic batch op (rename, codemod, regex refactor)	zero-sonnet with shell access	Sonnet + one `sed` call cost $4.81 on S3 — architecture is dead weight here
A one-off doc audit, mistakes are cheap to recover	zero-sonnet	$6.20 for a 7/10 audit on S2; architected cells charge $3.60 more for one quality point
Verifier-confirmed precision on a structured-finding job	v0.3.6	S2-v0.3.6 is the only cell with an A7 verifier pass — 6/6 HIGH confirmed, 0 downgrades, $9.80
To catch supply-chain BLOCKERs on a security-sensitive PR review	v0.3.6	S1-v0.3.6 caught both arbitrary-cmd-exec and validation-bypass BLOCKERs at $24.59; zero-sonnet missed both at $8.89
A workflow you can run nightly in CI, audit by reviewer, hand to a junior operator	v0.3.6	The architected packet is a committed artifact with per-lens trace; zero-x is per-run improvisation
To avoid the worst-case cost trap (over $30 on a small job)	v0.3.6, never v0.2	v0.2's per-file loop cost $33.79 on the same rename v0.3.6 did for $10.40 — bounded variance is the real win

Per-cell scorecard (raw $, quality, weighted ROI):

Cell	Cost	Quality	ROI_raw	ROI_weighted	What it bought
S3-v0.3.6	$10.40	10/10	0.96	0.96	Bounded-variance rename via S7 BRIDGE
S3-zero-sonnet	$4.81	10/10	2.08	2.08	Cheapest correct path on a deterministic batch
S3-v0.2	$33.79	9/10	0.27	0.27	Cautionary tale — per-file loop
S3-zero-opus	$41.01	2/10	0.05	0.02	Fuel-burned-on-a-wrong-symbol failure
S2-v0.3.6	$9.80	8/10	0.82	2.45	Verifier-confirmed precision (only cell)
S2-zero-sonnet	$6.20	7/10	1.13	2.90	Cheapest reasonable audit
S2-v0.2	$9.42	7/10	0.74	2.34	Five lenses, no verifier
S2-zero-opus	$33.01	8/10	0.24	0.73	Pays Opus premium without measurable return
S1-v0.3.6	$24.59	9/10	0.37	2.28	Only cell that caught both supply-chain BLOCKERs
S1-zero-sonnet	$8.89	6/10	0.67	2.59	Reasonable review, missed BLOCKERs
S1-v0.2	$3.94	3/10	0.76	2.79	Under-completed (27-line stub)
S1-zero-opus	$19.82	7/10	0.35	1.36	Opus premium does not pay back vs. Sonnet
S3-v0.3.7	$14.30	10/10	0.70	0.70	Substrate path identical to v0.3.6; cell overhead from fixture rebuild
S2-v0.3.7	$7.10	8/10	1.13	3.38	Best ROI in the matrix; 0 spawns (AUDIENCE BOUNDARY refusal)
S1-v0.3.7	$18.53	8/10	0.43	2.27	5/5 CAVEMAN_FULL on the wire; −24.6% vs v0.3.6 at near-parity

Higher ROI is better. ROI_weighted uses BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3. Full ROI definition in REPORT.md § The ROI function.

Why this experiment exists

Genesis is a skill an LLM-driven agent loads to architect other agentic workflows. When a team says "design us a PR-review panel" or "audit our docs corpus", the genesis architect decides how many sub-agents to spawn, which model each runs on, what tools each can call, and how to wire their outputs.

Before this PR, the architect optimised for correctness only. Cost was implicit, and that made the architect prone to two failure modes we kept observing in practice:

Naive fan-out — spawning a 5-lens panel of Opus sub-agents for a task that a single Sonnet shell call could finish. Quality unchanged, cost 10–50× higher.
Wrong-primitive binding — designing a per-file view + edit loop for a 19-file rename instead of a single perl -i shell invocation. The agent does the work; it just does it through the wrong tool surface.

The patterns added in v0.3.6 give the architect explicit vocabulary for both: B12 SELECTION RULE (how to pick the right primitive), B15 TOOL SUBSET + S7 DETERMINISTIC TOOL BRIDGE (when a shell call replaces N file edits), A12 GRADIENT WORKFLOW + benchmark-grounded model catalog (Sonnet for the bulk, Opus only for narrow arbiter roles), and A1 PANEL + A7 ADVERSARIAL VERIFIER (multi-stream reviewability with precision evidence). The question this PR answers is: do those patterns actually move the needle in production-grade conditions, on real workloads, against honest baselines?

How we measured

Three scenarios, each run four ways:

S3 — Bulk rename (19 JS files, 60–90 reference sites, npm test must pass)
S2 — Doc audit (11-page CLI documentation corpus, drift + link + schema lenses)
S1 — PR review (feat(lsp): add first-class LSP server support to install pipeline microsoft/apm#1424, +2363/-114 across 24 files)

The four cells per scenario are zero-opus and zero-sonnet (single-shot prompt to a frontier model, no architecture), v0.2 architected (workflow designed by the previous-version architect), and v0.3.6 architected (workflow designed by this PR's architect with the new pattern vocabulary).

Cost is the real-telemetry sum of usage_input_tokens, usage_output_tokens, usage_cache_read_tokens, and usage_cache_write_tokens for the executor session and any sub-agents it spawned, priced at Anthropic public rates. Per-cell cost-report.json files are committed alongside chat-session ids for replay. The architect session is not counted — architecting is amortised, infrequent infrastructure.

Quality was graded by the orchestrator against scenario-specific rubrics: binary correctness plus tool-mechanic cleanliness for S3, real structural drift caught + verifier precision for S2, actionable bugs caught + severity calibration for S1. Full grading detail with the per-finding evidence is in REAL-TELEMETRY-RESULTS.md.

A buyer's decision is return on each dollar, not absolute spend, so we score every cell on three ROI axes — raw QualityScore / Cost, severity-weighted (BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3), and tail-risk-adjusted (QualityScore / (Cost + P(failure) × C_failure)). The first answers "what does the dollar buy?"; the second weights findings by impact-class; the third dominates when missing one finding costs orders of magnitude more than the run itself. Higher is better on all three. The full definition with inputs and caveats is in REPORT.md § The ROI function.

What we found — three findings, in plain language

1. Bad architecture costs more than no architecture

The S3 rename cell where the v0.2 architect designed a per-file view + edit loop cost $33.79 in real telemetry. The same rename, executed by Sonnet directly with one sed shell call, cost $4.81. Same outcome, 7× the cost, because every tool call round-trips the entire conversation context back to the model and a 19-file rename done one file at a time pays that overhead 19 times.

The v0.3.6 architect, given the same brief, rejected the per-file design and chose a single grep | xargs perl -i shell call (the S7 DETERMINISTIC TOOL BRIDGE pattern). It cost $10.40 — 3.2× cheaper than v0.2 and within 2× of zero-sonnet — and the architecture overhead bought bounded variance: no risk of the per-file loop trap, no risk of substituting the wrong symbol the way zero-opus did. This is the load-bearing claim of the PR: the new pattern vocabulary makes the architect reject wrong-tool-surface designs, and the gap is measured, not modelled.

2. On simple workloads, single-shot Sonnet is the cost-rational baseline

On S2 doc-audit and S1 PR review, zero-sonnet wins on raw $/quality. Sonnet at $6.20 produced a 7/10 audit. The v0.3.6 architecture at $9.80 produced an 8/10 audit. Sonnet at $8.89 produced a 6/10 PR review. The v0.3.6 architecture at $24.59 produced a 9/10 review.

Architecture has a fixed overhead. Every sub-agent dispatch reloads the Copilot CLI tool descriptions, the system prompt, and the genesis skill bundle — roughly 80K input tokens per turn that the single-shot path does not pay. On workloads where one Sonnet pass already produces good output, that overhead is dead weight. The v0.3.6 architect knows this: its S2 doc-audit handoff explicitly chose monolithic single-session over fan-out, citing the B12 SELECTION RULE wrong-primitive-binding warning.

What architecture buys on these workloads is not lower mean cost. It is multi-stream reviewability (five independent lens verdicts instead of one prose blob), class-routed quality (Sonnet on reviewer lenses, Haiku on trivial ones), and — uniquely on v0.3.6 — adversarial verification. The S2-v0.3.6 cell is the only one with a verifier pass that confirmed 6/6 HIGH findings with zero downgrades. That is the kind of precision evidence a buyer cannot get from a single-shot prompt at any cost.

3. On production-critical workloads, severity weighting flips the verdict

The S1 PR review is the test case where ROI_raw and ROI_tail diverge. Three of the four cells produced reasonable reviews. Only v0.3.6 caught the two supply-chain security BLOCKERs — arbitrary command execution via plugin-controlled LSP config, and validation bypass via the raw-dict install path. zero-opus and zero-sonnet both missed both. v0.2 under-completed the workflow entirely (a 27-line YAML stub with three bullet points).

On raw $/quality, zero-sonnet still wins (0.67 vs v0.3.6's 0.37). On severity-weighted ROI, the gap closes (zero-sonnet 2.59 vs v0.3.6 2.28 weighted-points-per-dollar) — the BLOCKER × 5 weighting moves v0.3.6's 56 weighted points to nearly twice the next cell's. On tail-risk-adjusted ROI, v0.3.6 dominates: the cost of merging a supply-chain CVE into a published CLI is at minimum thousands of dollars in remediation and reputation, against an architecture premium of ~$15. The math says: pay the premium when C_failure ≫ Cost, do not pay it when it does not. That is precisely the conclusion the v0.3.6 architect's catalog now encodes.

The full per-cell numbers, including ROI_raw, ROI_weighted, and the failure-mode column for each cell, are in the scorecard at the top of REPORT.md. Cost methodology is in PROFILING-PROTOCOL.md. Full grading detail with the per-finding evidence is in REAL-TELEMETRY-RESULTS.md.

Per-pattern attribution — which patterns moved which dollars

Pattern	Empirical effect	Confidence
S7 DETERMINISTIC TOOL BRIDGE	$23.39 saved on S3 (v0.2 $33.79 → v0.3.6 $10.40) — same brief, same fixture	Strong
B12 SELECTION RULE + explicit `model:` declarations	~$3–5 saved on S1 by routing 63 trivial-lens calls to Haiku ($2.01) instead of Sonnet	Strong — per-model breakdown in `cost-report.json`
A12 GRADIENT WORKFLOW	Applied in S1; refused by the architect in S2 and S3 (gradient-free workflows). The refusal is itself the win on simple workloads	Strong-by-omission
B15 TOOL SUBSET	Visible at the Haiku tier (6K input floor vs. 35K Sonnet sub-agent)	Partial — only effective on tier-aware agent types
A1 PANEL + A7 ADVERSARIAL VERIFIER	Quality patterns, not cost patterns. Spend $3.60 (S2) and $15.70 (S1) over zero-sonnet for verifier-confirmed precision and BLOCKER-catching	Strong — only S2-v0.3.6 has a verifier pass; only S1-v0.3.6 caught the BLOCKERs
B13 CACHE-AWARE PREFIX	No measurable differentiation — all cells already cache at 93–99%	None — currently a guideline, not a lever
B14b CAVEMAN BRIEF	Empirically load-bearing in v0.3.7 (this PR). 5/5 CAVEMAN_FULL on the wire on S1; cost delta −24.6% on S1, −27.5% on S2N at quality parity. Receipt audit at `S1-v037-real/caveman-classification.md`	Strong — receipts queryable, cost delta in real telemetry

The honest summary: S7 + B12 + B15 + A12 + B14b (new in v0.3.7) are doing the work; B13 + B16 are catalogued but not yet pulling weight. Full attribution and v0.4 frontier (5 quick wins) in REAL-TELEMETRY-RESULTS.md § Per-pattern attribution and § v0.3.6 flaws and quick wins.

Reliability and predictability — the dimension ROI does not capture

Raw $/quality scores a single cell on a single run. It misses three properties that matter once the question shifts from "which is cheapest for this one task?" to "which workflow do I want in a CI pipeline run nightly, or handed to a junior engineer?":

Variance bounds. Single-shot Sonnet has no worst-case floor. The architected workflow does — every lens runs, every verifier confirms, the worst case is the worst lens. n=1 here so we cannot measure variance directly, but the structural argument holds.
Auditability. S1-v0.3.6 produced 5 lens artifacts + 1 synthesizer + 1 verifier, all committed. Zero-sonnet produced one prose blob. For workloads that gate production decisions (security review, compliance audit, release sign-off), the multi-stream artifact is a working audit trail; the blob is not.
Repeatability. The architected workflow is a committed packet plus N committed sub-agents. A different operator, on a different day, running the same packet against the same brief, gets a comparable workflow shape. The single-shot prompt is improvisational — the prompt typed once in a chat session is not a reusable artifact unless a human turns it into a .agent.md and adds tests, which is exactly what the architect already does.

This shifts the buyer's calculation. For one-off use, zero-sonnet is fine. For productionised, repeatable, automatable use, v0.3.6 is the only thing on the table — because it is the only one that produces a workflow artifact at all. Detail in REAL-TELEMETRY-RESULTS.md § Reliability and predictability.

What changed in the corpus

New patterns in skills/genesis/assets/design-patterns.md: B12 SELECTION RULE, B13 CACHE-AWARE PREFIX, B14b CAVEMAN BRIEF, B15 TOOL SUBSET, B16 EFFORT GOVERNOR, S7 DETERMINISTIC TOOL BRIDGE
New architectural pattern A12 GRADIENT WORKFLOW
Model catalog with benchmark-grounding reference (vals.ai SWE-bench data)
Empirical-proof corpus at dev/empirical-proof/ — twelve runtime cells with real-telemetry validation, formal ROI definition, per-cell artifacts and chat-session ids for replay

What this PR does NOT prove

This is a small, opportunistic study. We have n=3 scenarios and n=1 per cell. The conclusions above are about the shape of the cost-quality curve under v0.3.6 patterns, not a SWE-bench-grade benchmark. Specifically:

It is not a claim that v0.3.6 is always cheaper. It is not, on raw $, for S1 and S2.
It is not a claim the catalogue of patterns is complete. Telemetry shows every sub-agent dispatch pays a fixed entry tax of roughly 6K tokens (Haiku/explore), 35K (Sonnet sub-agent), or 54K (full orchestrator) before the prompt is read, and naive per-file loops amplify it via rolling re-send — S3-v0.2 burned a 290:1 input/output ratio for that reason. No pattern in this PR compresses that surface; it is the visible frontier for the next PR. The full breakdown, with per-bucket turn counts and per-cell I/O ratios, is in REAL-TELEMETRY-RESULTS.md.
Quality grading is orchestrator-judged, not held against a human gold standard. We are confident about the ordinal rankings and the BLOCKER findings on S1; the absolute /10 scores are best-effort.
P(failure|cell) in ROI_tail is the empirical 0/1 of this single run, not a calibrated probability. The tail-risk argument relies on the structural claim that C_failure ≫ Cost for security-critical workloads, not on a fitted failure-rate model.

Replaces the analytical projection in scenario-pr-review-panel.md with ground-truth measurement from Copilot CLI per-turn telemetry. Result: same panel job, same PR, same harness: - Executor A (v0.2.0 design): 164 turns, 8.71M tokens, $5.01 - Executor B (v0.3.0 design): 89 turns, 4.07M tokens, $3.02 - 1.66x cheaper, no critical-finding regression Adds: - tools/profile-tokens.py (Copilot CLI usage-block parser) - measurements/ (7 prior sessions of this work-stream) - ab-experiment-apm-1424/ (the controlled A/B + REPORT.md) The 15x headline from the prior projection is explicitly retracted: real reductions come mostly from B13 PROMPT THRIFT + tool-subset discipline (turn-count drop), not from model routing or aggressive cache tricks. Cache discipline (B12) is real and measurable at 90%+ in every session profiled, but it is an enabler not a multiplier. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Mirrors the PR #12 description so the ground-truth report is version-controlled alongside the data it cites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Triggered by the empirical A/B in PR #12: Executor B declared role classes per agentic element but bound every element to Sonnet 4 (the session default), missing B12 MODEL ROUTER. Root cause: the v0.3.0 corpus did not say loudly enough that on Copilot, the per-element binding site for model/tools is .agent.md custom-agent frontmatter -- SKILL.md does NOT accept those fields and silently ignores them. Fixes across three files: skills/genesis/assets/runtime-affordances/per-harness/copilot.md - Cite canonical custom-agents-configuration docs URL - PERSONA SCOPING FILE section: enumerate full .agent.md frontmatter (model, tools, target, disable-model-invocation, user-invocable, mcp-servers, metadata) with field-by-field explanation pointing each cost lever at its binding site - MODULE ENTRYPOINT (SKILL) section: explicit IMPORTANT block that SKILL.md does NOT support model: or tools:; spell out the architectural consequence (restructure as .agent.md if per-element binding is needed) - Section 9 Cost-pattern bindings: rewrite B12, B15, B16 to name .agent.md as the BINDING SITE with an explicit SKILL-LEVEL ROUTING ATTEMPT anti-pattern skills/genesis/assets/runtime-affordances/model-catalog.md - 'What this file does NOT do': add bullet noting binding site is harness-specific and the architect must consult the per-harness adapter, citing PR #12 as the failure mode if missed - 'How per-harness adapters extend this file': require adapters to NAME THE PER-ELEMENT BINDING SITE explicitly skills/genesis/assets/design-patterns.md - B12 MODEL ROUTER: new WRONG-PRIMITIVE BINDING anti-pattern with PR #12 as the worked example - B15 TOOL SUBSET: same anti-pattern (mirrors B12 failure mode) Companion PR description updated to flag the architect-B miss explicitly and to propose the B12-firing rerun as the next iteration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Third controlled run, fixed-corpus rerun: - Architect C (v0.3.1, fixed binding-site corpus) produced a 339-line handoff with explicit per-element MODEL BINDING TABLE: 6 of 7 sites bound to non-default SKUs (5 lenses + arbiter, planner=Opus, trivial=Haiku, others=Sonnet). - Executor C honored the binding by passing model: to each task sub-agent dispatch. Telemetry confirms 3 distinct models in real billing: 15 turns on claude-haiku-4.5, 37 on claude-sonnet-4.6, 29 on claude-opus-4.7. - Output quality parity with Exec A and B: catches the LSP env-RCE CRITICAL plus 1 HIGH + 9 MEDIUM + 9 LOW. Key empirical finding: B12 MODEL ROUTER fires as designed at the sub-agent level (~10x cheaper per-turn on Haiku-bound trivial lenses) but the orchestrator's session-default model dominates total cost when it is more expensive than the routed sub-agents. In this run the executor session ran on Opus 4.7 by default; the 29 orchestrator turns alone cost $6.14, swamping the $2 saved by Haiku routing. Counterfactual with orchestrator bound to Sonnet (matching handoff intent): $2.73, which would be 10% cheaper than Exec B's $3.02. This generates a new corpus lesson for v0.3.2: B12 must include the orchestrator thread as a binding site. PR description (and REPORT.md mirror) updated with full 3-way FinOps analysis and the proposed v0.3.2 corpus addition. Artifacts: - architect-C-v0.3.1-handoff.md - executor-C-v0.3.1-review.md - executor-C-tokens.json - executor-C-process.log.gz (3MB, 19MB uncompressed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lled orchestrator The earlier 3-cell A/B/C runs all had Opus session-default orchestrators that masked the corpus-level signal. This commit adds a clean 2-cell A/B with the only variable being the genesis corpus version: - Both architects: claude-opus-4.7 (design tier) - Both executor orchestrators: claude-sonnet-4.6 (pinned) - Same target PR: microsoft/apm#1424 - Same lens count: 5 (correctness, security, performance, style, test-coverage) Result: - Cell D (v0.1 baseline, no cost-aware corpus): $11.77 total, 52 findings, 6 BLOCKER - Cell E (v0.3.1 treatment, B12 fires at 7 sites): $14.68 total (+25%), 61 findings, 4 CRITICAL + 10 more above threshold Cell E catches 2 additional CRITICAL security findings Cell D missed (SEC-001 TOCTOU symlink race, SEC-002 validated-object discarded). Honest reframing: in Copilot CLI where task(explore) defaults to Haiku, B12 promotes lenses UP from Haiku to Sonnet, which is a quality-routing knob, NOT a cost-reduction knob. The v0.3.1 corpus over-applies B12 by encouraging explicit binding at every .agent.md primitive. v0.3.2 must add a B12 SELECTION RULE: bind explicitly only when stakes, portability, or operator economic preference justifies it; otherwise trust the harness default. Artifacts added: - architect-D-v0.1-handoff.md, architect-E-v0.3.1-handoff.md - executor-D-v0.1-review.md, executor-E-v0.3.1-review.md - executor-D-findings.json, executor-E-findings.json - architect-D-process.log.gz, executor-D-process.log.gz - architect-E-process.log.gz, executor-E-process.log.gz - tools/profile-per-model.py (new per-model attribution profiler) - REPORT.md updated to v4 with the clean 2-cell story PR #12 body updated to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

After Cell F (v0.3.2 SELECTION RULE) only got to +7% vs v0.1 baseline, RCA #2 named the residual cost driver: synth-heavy dispatched to Opus by default for cross-lens adjudication = HEAVY ADJUDICATOR anti-pattern. v0.3.2.1 corpus edit (architectural-patterns.md §A12): - HEAVY ADJUDICATOR anti-pattern: synthesis that reconciles already- produced lens findings is reviewer-class, not planner-class. - "WHERE THE HEAVY ROLE BELONGS" cure paragraph: bind planner only on rare, narrow triggers (>=2 BLOCKERs + contradictory + same diff hunk; expected firing rate ~2-4%). - Cell F named as the empirical canonical case. Cell G result on microsoft/apm#1424: - Architect G (Opus, 20 turns): $7.34 - Executor G (Sonnet orch, 179 turns): $2.85 -> 115 Haiku ($0.91) + 64 Sonnet ($1.93) + 0 Opus ($0) - TOTAL: $10.19 (-13.4% vs Cell D baseline $11.77) - Opus arbiter correctly stayed dark (narrow trigger not met) - Bug-finding parity with Cell D (same class of issues) - 2 false-positive BLOCKERs caught and downgraded via inline gh-api verification (Sonnet orchestrator), avoiding wasteful Opus dispatch Final cell table: | Cell | Corpus | Total | Delta vs v0.1 | | D | v0.1 | $11.77 | baseline | | E | v0.3.1 | $14.68 | +24.7% (FAIL)| | F | v0.3.2 | $12.63 | +7.3% (PARTIAL)| | G | v0.3.2.1| $10.19 | -13.4% (WIN) | This commit ships: - The v0.3.2.1 corpus edit (HEAVY ADJUDICATOR anti-pattern in A12) - Cell F + G architect handoffs, executor reviews, gzipped process logs, per-lens findings JSONs - REPORT.md rewritten to v5 with full D/E/F/G arc, iteration narrative, mermaid diagrams for Cell G (winner) and Cell E (over-bound failure), and named load-bearing corpus edits Corpus claim now empirically grounded: cost-aware genesis corpus produces designs neatly cheaper than the unconscious v0.1 baseline, with parity on bug-finding quality. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…iscipline Drift correction surfaced in PR review iteration. Cell G (v0.3.2.1) landed on a -45% executor-cost shape by OMITTING model: and relying on the Copilot CLI harness default for task(agent_type='explore'). That works on Copilot CLI today but is NOT engineering — it is hope-based: not portable across harnesses, not predictable across harness versions, no audit trail. The actual cost anti-pattern is BIND-UP-WITHOUT-JUSTIFICATION (forcing IMPLEMENTER/PLANNER on role classes that TRIVIAL/REVIEWER work would meet — Cell E v0.3.1, +25% cost vs baseline). Explicit binding that matches the harness default is PREDICTABILITY DISCIPLINE, not ceremony. Changes: - skills/genesis/assets/design-patterns.md §B12: rule 3 of SELECTION RULE rewritten — DEFAULT == REQUIRED case is now BIND EXPLICITLY (predictability / portability / audit-trail); OMIT only as exception. CONSEQUENCE paragraph rewritten — well- designed B12 has MOST elements bound. CEREMONIAL BINDING anti- pattern replaced by BIND-UP-WITHOUT-JUSTIFICATION (still cites Cell E as in-corpus case); CEREMONIAL BINDING narrowed to copy- pasted bulk bindings without per-element role-class distinction. - skills/genesis/assets/runtime-affordances/per-harness/copilot.md §9 B12 site: reframed 'CEREMONY' bullet to 'PREDICTABILITY DISCIPLINE', listing portability + predictability + audit-trail as reasons to bind explicitly even when it matches the default. - skills/genesis/examples/06-cost-aware-panel.md: PROVENANCE WARNING rewritten — explains the Cell E vs F/G distinction and steers architects toward explicit model: declarations even on Copilot CLI. Predictability probe (3 explore dispatches, /bin/bash.19 Haiku): task(agent_type='explore') fired claude-haiku-4.5 reliably across trivial/medium/complex prompts on Copilot CLI. Harness default IS stable for complexity TODAY. BUT portability across harnesses (claude-code, opencode, codex, cursor) is NOT verified — explicit binding is the only portable discipline. NOT in this commit: - examples/04, references/cost-economics-process.md, token- economics.md (no stale CEREMONIAL BINDING references found). - Genesis-audits-genesis pass (running async). - Re-run with v0.3.3 corpus. - REPORT.md v6 with executor-only framing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ty probe, surgical bloat removals (-248 lines) Three concurrent corrections shipped together: 1. REPORT v6 -- EXECUTOR-ONLY cost framing. - Headline corrected from -13% (architect+executor bundled) to -45% (executor-only), the number the operator actually pays per run. Architect amortizes once across many runs of the designed workflow. - v5 was wrong by ~3x in the conservative direction; this correction tightens the empirical claim. - Predictability probe (/bin/bash.19, 3 explore dispatches, all fired Haiku) documented and validates Cell G's OMIT tactic as currently-not-broken on Copilot CLI -- not as recommended. - v0.3.3 reframe rationale explained: 'bind explicitly for PREDICTABILITY + PORTABILITY + AUDIT TRAIL' is the actual discipline; OMIT was a Copilot-CLI-only tactic that worked by accident. No re-run required because routing is identical (TRIVIAL -> claude-haiku-4.5 on Copilot CLI today, declared explicitly vs inherited). - Per-technique attribution and multi-scenario variance explicitly DEFERRED to follow-up PRs. 2. Genesis-audits-genesis pass -- surgical bloat removals. - Auditor (Opus 4.7, single architect cell) applied the genesis skill to audit the genesis corpus. - Recommendations: -720 to -930 lines projected; -248 lines applied this PR (other consolidations deferred for safety). - Removals: stance prose compression, step 3.2 sub-block collapse, copilot.md cost-pattern bindings -> table form, scaffolding removal across 3 files, B12 CONSEQUENCE block (pure restatement), war-story citation overcount. - Full audit at dev/empirical-proof/audit-v0.3.3/removal-list.md for follow-up consideration. 3. Continued v0.3.3 reframe propagation. - examples/06-cost-aware-panel.md PROVENANCE WARNING and dollar arithmetic removed (worked example reduced to qualitative pattern citations; concrete numbers belong in cost-economics-process.md step 6 template, not duplicated inline per example). Net corpus delta (this commit): -468 +311 = -157 lines (corpus itself, not counting REPORT/probe/audit artifacts). Cumulative PR corpus delta: now ~+1700 net (was +1946 before audit). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… appendix Per operator feedback: the PR story is baseline vs head; intermediate corpus iterations (v0.3.1, v0.3.2) belong in an appendix, not the headline. Learnings preserved (RCA #1 lens fan-out leak, RCA #2 synth-heavy adjudicator leak, Cell E architecture diagram) — they are how the BIND-UP-WITHOUT-JUSTIFICATION and HEAVY ADJUDICATOR anti-patterns were discovered empirically rather than from first principles. Confounded 3-cell A/B/C history moved to Appendix B. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Corpus additions: - assets/architectural-patterns.md §A1 PANEL: add UNDIFFERENTIATED LENS BINDING anti-pattern. Forces per-lens CAPABILITY PROFILE enumeration before binding (cross-file reasoning? STAKES-weighted output? multi-step proof?). - assets/design-patterns.md §B12 BULK IDENTICAL BINDING variant: strengthened to fire in BOTH directions (bulk-UP and bulk-DOWN). Cost direction is not the anti-pattern; lack of per-element reasoning is. Per-element CAPABILITY PROFILE template added; cross-references A1 PANEL. REPORT additions: - New Per-technique attribution section: isolates B12 (-$2.16 preventative), A12 (-$3.95 active, dominant win), B13 ($14 defensive) from existing 4-cell D/E/F/G data via pairwise cell deltas. Honest framing: D->G -45% is mostly A12, not B12 (D accidentally inherited Haiku via harness default). - New v0.3.4 PER-LENS DIFFERENTIATION subsection: explains the corpus change and why uniform Haiku binding on the 5 advisory lenses remains correct for this skill (after enumeration) but would differ on a verdict-emitting skill. - Updated 'What this PR proves' (7 numbered points; per-technique + PER-LENS now included) and 'What this PR does NOT prove' (per-technique removed; B14/B15/B16 ablations + multi-scenario + cross-harness retained as explicit deferrals). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ED bindings Dispatched Opus 4.7 architect on a deliberately different skill type (release-notes-generator: A2 PIPELINE, 50-commit input, mixed sub-tasks). Same persona, same v0.3.4 corpus, different problem shape. Result: role-class distribution DIFFERENTIATED, not uniform. - 2 TRIVIAL (classifier E1, batched bug-fix one-liners E4) - 2 IMPLEMENTER (orchestrator E0, feature prose E2) - 2 REVIEWER (breaking-change prose E3, consistency pass E5) - 0 PLANNER Architect explicitly flagged BIND-UP-WITHOUT-JUSTIFICATION risk on E4: 'wrongly slap-binding [30 bug-fix one-liners] to sonnet under release-notes is user-facing would be BIND-UP-WITHOUT-JUSTIFICATION and would inflate the per-run cost by roughly 30 premium requests.' The v0.3.4 corpus discipline working in the wild. Predicted L-scenario cost: ~$0.18-0.25 per run. A12 GRADIENT savings vs flat-sonnet hypothetical: ~29 premium requests per run. This validates that the PR-review panel's uniform Haiku binding is NOT rubber-stamping; it is the discipline working correctly on uniform-profile inputs. When profiles are heterogeneous (this run), the discipline produces heterogeneous bindings. Handoff packet persisted at dev/empirical-proof/scenario-release-notes/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…S discipline Dispatched v0.3+ panel against microsoft/apm#1541 (+41/-2, 2 files, small CLI fix). Results: - Executor cost: ~$0.21 (vs $2.85 on PR #1424) - Per-kLoC: $5.12 (vs $1.15 baseline) — fixed Sonnet-executor overhead dominates at small scale (~93% of total cost) - Panel cost shape holds in dollar terms; per-kLoC ratio inverted - Arbiter trigger correctly did NOT fire (0 BLOCKERs) KEY EMPIRICAL VALIDATION of v0.3.4 PER-LENS DIFFERENTIATION: The executor reflected per-lens against the CAPABILITY PROFILE template and concluded 4/5 lenses genuinely TRIVIAL, but security lens was INADEQUATE on TRIVIAL/Haiku — it surfaced a real MEDIUM bypass concern but could not validate it without out-of-diff function body access. This empirically generates the per-element justification the v0.3.4 corpus requires architects to record at design time. Recommended carve-out: 'Security lens uses Haiku when all referenced functions are in-diff; escalates to REVIEWER with tool access when it must reason about out-of-diff internals.' Implication for PR #1424: security lens was likely mis-bound to TRIVIAL. The blocker false-positive (_substitute_plugin_root alleged undefined, refuted only via out-of-diff gh api lookup) is consistent with TRIVIAL- class inadequacy on cross-file reasoning. A v0.3.4 re-architect would correctly bind security to REVIEWER, expected cost delta +$0.50-1.00 per run with measurable security finding fidelity improvement. REPORT updated with multi-scenario section (small-PR + different-skill). Deferral list narrowed: full S1-S5 × {v0.2,v0.3+} matrix remains follow-up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…Foundry docs - model-catalog.md: add 'researcher' as a sixth role class for open-ended exploration (distinct from planner which optimises a stated goal, and from long-context-retriever which extracts from a known corpus). Capability profile: open-ended exploration with discovery-phase success criteria, multi-source synthesis, multi-hypothesis reasoning. Cost profile: highest tier + reasoning multiplier. Requires narrow-trigger discipline (cited STAKES) to bind, mirroring A12 arbiter discipline. Pattern matching != research; if a rubric exists -> reviewer; if a plan exists -> planner. - model-catalog.md: add 'OpenAI / GPT-5 family specifics' section grounded against Microsoft Learn Azure Foundry reasoning docs. Documents reasoning_effort values (none|minimal|low|medium|high|xhigh), defaults per SKU (gpt-5.1 defaults to none; gpt-5-pro defaults to and only supports high; xhigh only on models after gpt-5.1-codex-max), and the per-role- class binding table. Architect MUST declare reasoning_effort alongside SKU for OpenAI bindings (same SKU spans 2-3 role classes by effort). - model-catalog.md: refresh all EXAMPLES sections with current GPT-5 lineup (gpt-5.1, gpt-5-mini, gpt-5-pro, gpt-5-codex, gpt-5.1-codex-max) per Azure Foundry docs. - design-patterns.md §B12: add researcher + long-context-retriever to the role-class enumeration in the SELECTION RULE. - design-patterns.md §B16 EFFORT GOVERNOR: extend role-class to effort- level mapping table with researcher (high to xhigh) and long-context- retriever (low). Explicit anti-pattern: suppressing effort on researcher defeats the binding. Grounding source: https://learn.microsoft.com/azure/foundry/openai/how-to/reasoning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ve-audit) × v0.2 vs v0.3.5 Six background Opus 4.7 dispatches, ~$30 architect spend. No executor runs in this probe. Scenarios chosen to stress different patterns: - S1 apm-triage-panel: PANEL with arbiter -> B12 PER-LENS, A12 discipline - S2 bulk-api-rename: STAFFED PLAN + batched edits -> S7 + B15 (CodeAct) - S3 dependency-cve-audit: PIPELINE + per-CVE fan-out -> RESEARCHER class (v0.3.5), LONG-CONTEXT-RETRIEVER, multi-class binding Three findings: 1. v0.2 architects identify the gap, can't fill it. All three v0.2 cells enumerated cost-aware patterns they wanted but couldn't cite. The taxonomy gap is real and felt by the architect persona. 2. v0.3.5 discipline produces heterogeneous outputs on heterogeneous inputs. S2 binds 1 element (IMPLEMENTER, BIND DOWN). S3 exercises 5 of 6 role classes. S1 differentiates 3 TRIVIAL + 3 REVIEWER after enumerating 6 capability profiles. 'All-Haiku' from PR #1424 review was correct for that specific input, not the discipline's default. 3. RESEARCHER class fires once across three scenarios, exactly as designed. Two of three v0.3.5 cells explicitly REJECT it with cited rule ('rubric exists -> REVIEWER'). Only S3 binds it with full STAKES citation. Narrow-trigger discipline working. S2 carries a concrete per-technique number: tools: [read, execute] structurally excludes the edit tool so the naive 50+-edit-turn anti-pattern is impossible by construction. apply-rename.sh script batches in one turn. Projected $0.05-0.10 per L run vs $0.50-1.20 naive (~10x saving from S7 + B15). Artifacts: dev/empirical-proof/cross-scenario/{S1-triage,S2-rename, S3-cve}-{v02,v035}/handoff.md (6 files, ~3700 lines total). Each contains 3 mermaid diagrams, interface sketches, module composition table, per-element model bindings, patterns cited, cost projections. REPORT updated with new 'Cross-scenario architect A/B' section + the 'What this PR does NOT prove' deferrals narrowed (B15 attribution now in scope, executor matrix still deferred). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

7 cells dispatched on real fixtures (S1 triage panel; S2 bulk rename). Each cell measured / modeled cost from sub-agent transcripts against Anthropic published $/Mtok rates. Per-pattern measured savings: - B15 + S7 TOOL SUBSET + CodeAct: 75x on S2 ($3.97 -> $0.053) - B12 PER-LENS ROUTING: 2.27x on S1 ($0.540 -> $0.238) - B14 CAVEMAN BRIEF (proposed): 1.81x on severity lens, 75% verdict agreement, only upward escalations - B16 EFFORT GOVERNOR: 1.72x + quality control (prevents TRIVIAL-lens severity inflation) - B13 CACHE-AWARE PREFIX: modeled ~2.5x on cached portion REPORT rewritten in Minto-pyramid form (headline -> three takeaways -> per-pattern table -> why -> per-pattern attribution -> proposed sub-pattern -> caveats). The bloated chronological REPORT is preserved as APPENDIX-iterative-history.md. PR body mirrors the new REPORT structure: someone who has not worked on the experiment can read it top-to-bottom and get the measurement, the why, the per-pattern numbers, and the caveats. CAVEMAN sub-experiment surfaces a candidate B14b CAVEMAN BRIEF sub-pattern for TRIVIAL-class classifiers: 45% input-token saving, zero downward severity errors, two defensible upward escalations. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…idated) Promoted from the cost-attribution experiment's caveman cell: 44.6% input-token saving on TRIVIAL severity classification, 75% agreement with verbose brief, zero downward errors. Gated to TRIVIAL class only (REVIEWER+ keep prose). Includes: - WHEN gate (TRIVIAL + fixed schema output) - mechanism + anchoring rule (extreme-bucket grounding) - measured effect citation (cost-report.json reference) - CAVEMAN ON REVIEWER anti-pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add 'FinOps view: scenarios at a glance' table near top with per-scenario v0.2 vs v0.3.5 cost and links to underlying evidence - Add ablations sub-table linking each B-pat cell artifact - Add side-by-side mermaid diagrams for S1 (triage) and S2 (rename) showing v0.2 vs v0.3.5 architectures with pattern annotations on the exact nodes/edges where each pattern applies - Hyperlink every pattern citation to its definition in design-patterns.md / architectural-patterns.md - Hyperlink every measurement claim to its cost-report.json - Self-contained: reader can verify any number without leaving this document Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Changes: - Fix mermaid render error: quote all node labels containing brackets, pipes, parens (previous version used unquoted [read, execute] inside cylinder shape, parser failed) - Add 'For a reader new to this work' prose section explaining what genesis is, the three cost drivers, and what models cost - Refine all 4 architecture diagrams: * Annotate architect model + reasoning effort + role * Annotate every sub-agent with model AND why (e.g. why synthesizer is Sonnet not Haiku, why executor is Sonnet not Haiku) * Show exact sub-agent spawn count and tool call count per cell * Show TRIVIAL/REVIEWER routing as labelled edges with dispatch counts (24 TRIVIAL + 24 REVIEWER) instead of abstract symbols * Show context-growth as explicit annotation on v0.2 S2 diagram - Add per-scenario explanatory prose: what the workflow does, why the architecture matters, why each model choice was made - Hyperlink every pattern and every claim to its source The report now reads top-to-bottom for someone with no prior context on genesis, the patterns, or the experiment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds dated empirical anchoring for Haiku / Sonnet / Opus role-class choices on coding and autonomous-loop boundaries. - model-catalog.md: two 3-line load triggers pointing to the new reference (file load + Routing-axes B12 citation). - references/benchmark-grounding.md: cross-benchmark table (SWE-bench Verified, Terminal-Bench 2.1, Vals Index), SWE-bench task-length bucket view, and SONNET-AVERSION / HAIKU-PROMOTION cited as WRONG-PRIMITIVE BINDING instances at B12. Architected by the genesis skill on Opus 4.7 (verified 2026-05-29). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nding refs - S2 replaced with doc-audit (4 cells: zero-Opus, zero-Sonnet, v0.2, v0.3.5) - Surface the +27% fan-out tax v0.3.5 pays on doc-audit honestly - Add Why-we-observe subsection explaining the TRIVIAL-surface threshold - Per-pattern table now shows B12 negative-net on doc-audit - NOT-proven section calls out missing fan-out threshold derivation - Appendix links new S2N cells + benchmark-grounding reference - Old bulk-rename relabeled Scenario 3 (extreme floor case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ion) Adds top-level REAL-TELEMETRY-RESULTS.md as ground-truth absolute-cost report. All numbers harvested from cloud session_store events table (real usage_input_tokens / output_tokens / cache_read_tokens / cache_write_tokens), priced at Anthropic public rates. Matrix: 3 scenarios x 4 cells (zero-opus, zero-sonnet, v0.2 architected, v0.3.6 architected). Architect sessions (Opus 4.7) NOT counted per profiling protocol -- architecting is amortised across runs. Headline findings (all measured, see file for cell-level cost-reports): - S3 bulk rename: v0.2 per-file-edit anti-pattern ($33.79) is 7x v0.3.6 S7 TOOL BRIDGE design ($10.40). v0.3.6 pattern WORKS. - S2 doc-audit: v0.3.6 architect chose MONOLITHIC (rejecting fan-out via B12 SELECTION RULE); ties v0.2 ($9.80 vs $9.42). - S1 PR review: v0.3.6 5-lens fan-out ($24.59) costs more than zero-sonnet ($8.89) -- architecture buys reviewability and class-routed quality, not absolute cost win. - zero-opus is the real anti-pattern: 3-8x zero-sonnet cost with marginal quality justification on these workloads. Files added: - dev/empirical-proof/REAL-TELEMETRY-RESULTS.md (new) - dev/empirical-proof/PROFILING-PROTOCOL.md (new) - dev/empirical-proof/scenario-runs/results/{S1,S2N,S3}-*-real/cost-report.json (12 new) - dev/empirical-proof/scenario-runs/results/REAL-COSTS-summary.csv (new) REPORT.md banner now points readers to the new file first, retains the size-modeled study as a pattern-isolation ablation (ratios valid, absolute $ understated 50-200x due to ignored tool-surface overhead). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…RESULTS Grade all 12 cell outputs against scenario-specific rubrics and compute $/quality-unit. Adds the buyer-side ROI framing the cost-only matrix was missing. Headline takeaways: - S3: zero-sonnet has best raw $/quality; v0.3.6 is 2x cost but eliminates the failure-mode tail (zero-opus failed task at $41, v0.2 burned 40 calls on what should be 1 shell call). - S2: zero-sonnet best raw $/quality; v0.3.6 ships verifier-confirmed precision (6/6 HIGH confirmed, 0 downgrades). - S1: v0.3.6 caught 2 supply-chain BLOCKER security findings the other 3 cells missed. On severity-weighted ROI it ties zero-sonnet; on raw $/quality it is 1.85x more expensive. Verdict reframe: v0.3.6 is insurance, not optimisation. Pays 1.5-2.5x on workloads where it does not produce unique findings; on workloads where it does (S1 supply-chain BLOCKERs, S3 anti-pattern rejection), avoided cost is 10-100x the architecture premium. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add formal ROI definition (ROI_raw, ROI_weighted, ROI_tail) with severity weights (BLOCKER=5, HIGH=3, MED=1, LOW=0.3) and a per-cell 12-row scorecard at the top of both REPORT.md and REAL-TELEMETRY-RESULTS.md. Restructure prose grading to live below the scorecard. Buyer-facing math: - ROI_raw = QualityScore / Cost - ROI_weighted = (Sum severity_weight x findings) / Cost - ROI_tail = QualityScore / (Cost + P(failure) x C_failure) Per-cell scorecard surfaces: - S3-zero-opus ROI_raw 0.05 (failed task at $41) - S3-zero-sonnet best ROI_raw at 2.08 - S1-v0.3.6 weighted=56pts (2x next best) — only cell catching 2 supply-chain BLOCKERs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds concrete per-cell evidence for the dispatch-overhead claim: - First-turn input by harness: 54K (Sonnet) / 75K (Opus) - Per-spawn entry tax by agent type: 6K Haiku, 35K Sonnet sub, 54K orch - Rolling re-send pattern (S3-v0.2): 54K→130K input growth, 290:1 I/O ratio - I/O ratio table across all 10 cells with non-trivial telemetry Mining done from the events table assistant.usage stream of each cell chat session. Numbers are real, real-billed, and reproducible by re-querying with the chat_session_id committed in each cost-report.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three new sections answering buyer questions: - Per-pattern attribution: which patterns moved which dollars (S7 +B12+B15+A12 load-bearing; B13/B14b/B16 catalogued but not pulling weight yet) - v0.3.6 flaws and quick wins (v0.4 frontier — entry-tax visibility, dispatch coalescing A13, B14b CAVEMAN, B15 tier-aware Sonnet, loop-detector for B12) - Reliability and predictability — the dimension raw ROI misses (variance bounds, auditability, repeatability) — argues v0.3.6 is the only thing on the table for productionised/automated use Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refactors the genesis skill so the B14b CAVEMAN BRIEF pattern actually operates inside the agentic workflows genesis designs. Telemetry showed v0.3.6 had 0/9 caveman briefs in S1 task() dispatches — caveman was a catalogue stub, not a runtime contributor. Core idea: AUDIENCE BOUNDARY (composition-substrate §7) — every artifact named INTERNAL or EXTERNAL; default INTERNAL=caveman, EXTERNAL=normal prose; boundary is artifact-audience, not agent-tier. Caveman was designed for human↔assistant prose; genesis spawns subagents the user never sees. INTERNAL traffic is safe and ideal for caveman; EXTERNAL traffic stays prose. Changes: - composition-substrate.md §7 AUDIENCE BOUNDARY (substrate-level rule) - design-patterns.md B14b expanded to canonical fidelity (drop list, preservation contract for code/paths/URLs/numbers/error strings, LITE/FULL/ULTRA intensities, auto-clarity exceptions, role-mode persistence, output-mode contract) - design-patterns.md B14c CAVEMAN CHANNEL (orchestrates the boundary across multi-spawn workflows) - SKILL.md: per-spawn declaration discipline + audience preamble - references/audience-boundary.md NEW (load-on-demand audience matrix) - assets/caveman-templates.md NEW (five brief templates + receipt schemas: severity, dup-oracle, label-picker, missing-info, style) - pattern-tradeoffs.md: caveman-on-external anti-pattern - refactor-patterns.md R6 AUDIENCE-BOUNDARY ENFORCE checklist - model-catalog.md TRIVIAL row → B14b/B14c cross-link ROI: concentrated on PANEL-shape workflows. S1-style ~43K input + ~14K output saved/run (~$0.34 uncached). S2/S3-shape workflows correctly gain 0; v0.3.7 does NOT pressure architects to invent spawns. Audit: files/caveman-audit.md Design plan: files/v037-design-plan.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three handoff packets produced by Opus + genesis v0.3.7 architect runs. These are the design-only outputs; execution will be done by separate Sonnet executor sessions to replicate the v0.3.6 two-session pattern (architect on opus, executor on sonnet) — apples-to-apples cost A/B. Packets: - S1-triage-v037/handoff.md: 5-spawn panel (CAVEMAN_FULL briefs) - S2N-v037/handoff.md: monolithic A9+A7 (0 spawns, refusal validated) - S3-rename-v037/handoff.md: S7 short-circuit (0 spawns) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three cold-start executor cells (S1, S2N, S3) committed with REAL TELEMETRY harvested from the cloud events table for executor chat sessions 82be8bfe / 41405aae / 99c365c2. Cost-report.json files are authoritative (executor-only, architect-cost out-of-scope per the established v0.3.6 methodology). Headline numbers (executor-only, real telemetry): S1-v0.3.7 $18.53 (vs v0.3.6 $24.59, -24.6%) quality 8/10 S2N-v0.3.7 $7.10 (vs v0.3.6 $9.80, -27.5%) quality 8/10 (parity) S3-v0.3.7 $14.30 (vs v0.3.6 $10.40, +37.5%) quality 10/10 (fixture-rebuild overhead) Caveman receipt fidelity (load-bearing v0.3.7 substrate validation): S1-v0.3.7 5/5 CAVEMAN_FULL on the wire (vs v0.3.6's 0/9 PROSE_LEAK) S2N/S3 0 spawns by design (AUDIENCE BOUNDARY refusal validated) RTR.md updated: - matrix table: v0.3.7 column added per scenario - per-cell scorecard: 3 v0.3.7 rows + summary - per-pattern attribution: B14b promoted to empirically-load-bearing - flaw 3 (B14b not invoked): marked RESOLVED - new section: v0.3.7 cold-start packet fidelity test PR body updated separately via gh pr edit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel changed the title ~~Empirical proof: real A/B measurement of v0.2.0 vs v0.3.0 panel cost on microsoft/apm#1424~~ Empirical proof: v0.2.0 vs v0.3.0 panel — measured 1.66x cost reduction on microsoft/apm#1424 (FinOps report) May 29, 2026

danielmeppiel and others added 19 commits May 29, 2026 14:03

Replace REPORT.md with FinOps PR description body

b23377e

Mirrors the PR #12 description so the ground-truth report is version-controlled alongside the data it cites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

danielmeppiel changed the title ~~Empirical proof: v0.2.0 vs v0.3.0 panel — measured 1.66x cost reduction on microsoft/apm#1424 (FinOps report)~~ Token economics in genesis: measured cost patterns with one honest counter-example May 29, 2026

danielmeppiel changed the title ~~Token economics in genesis: measured cost patterns with one honest counter-example~~ Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells May 29, 2026

danielmeppiel and others added 6 commits May 29, 2026 23:13

danielmeppiel merged commit 0abf6f0 into main May 30, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells#12

Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells#12
danielmeppiel merged 28 commits into
mainfrom
empirical-proof-real-ab

danielmeppiel commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielmeppiel commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Headline

TL;DR — when to pick which

Why this experiment exists

How we measured

What we found — three findings, in plain language

1. Bad architecture costs more than no architecture

2. On simple workloads, single-shot Sonnet is the cost-rational baseline

3. On production-critical workloads, severity weighting flips the verdict

Per-pattern attribution — which patterns moved which dollars

Reliability and predictability — the dimension ROI does not capture

What changed in the corpus

What this PR does NOT prove

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielmeppiel commented May 29, 2026 •

edited

Loading