Skip to content

Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells#12

Merged
danielmeppiel merged 28 commits into
mainfrom
empirical-proof-real-ab
May 30, 2026
Merged

Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells#12
danielmeppiel merged 28 commits into
mainfrom
empirical-proof-real-ab

Conversation

@danielmeppiel

@danielmeppiel danielmeppiel commented May 29, 2026

Copy link
Copy Markdown
Owner

Headline

This PR makes token economics a first-class design dimension in the genesis skill (v0.2 → v0.3.6). A new vocabulary of patterns lets the architect reason about which model each sub-agent runs on, which tools it can call, and how big each prompt needs to be — not just whether the workflow is correct.

We validated this on three workloads against four execution shapes each (twelve cells total), with every dollar harvested from the Copilot cloud events table — no estimates, no models. The headline is honest:

v0.3.6 is insurance, not optimisation. It pays back on workloads where missing a finding is expensive — supply-chain security, anti-pattern rejection, verifier-confirmed audits. It does not pay back on simple workloads where a single Sonnet pass already produces good output.

v0.3.7 update (this PR's headline addition). The B14b CAVEMAN BRIEF + B14c AUDIENCE BOUNDARY substrate is now empirically load-bearing. A cold-start fidelity test (Opus architect → fresh Sonnet executor, packet-only, no skill loaded) produces 5/5 CAVEMAN_FULL spawn briefs on the wire for S1's 5-lens panel (vs v0.3.6's 0/9), and the cost effect at quality parity is −24.6% on S1, −27.5% on S2N with real telemetry. S3 substrate path is byte-identical to v0.3.6; its +37.5% cell number is documented fixture-rebuild overhead. Detail in § v0.3.7 cold-start packet fidelity test.

The next three sections explain why that conclusion is the right one, what the experiment actually measured, and how to read the per-cell numbers.


TL;DR — when to pick which

If you need... Pick Why (one-line evidence)
A deterministic batch op (rename, codemod, regex refactor) zero-sonnet with shell access Sonnet + one sed call cost $4.81 on S3 — architecture is dead weight here
A one-off doc audit, mistakes are cheap to recover zero-sonnet $6.20 for a 7/10 audit on S2; architected cells charge $3.60 more for one quality point
Verifier-confirmed precision on a structured-finding job v0.3.6 S2-v0.3.6 is the only cell with an A7 verifier pass — 6/6 HIGH confirmed, 0 downgrades, $9.80
To catch supply-chain BLOCKERs on a security-sensitive PR review v0.3.6 S1-v0.3.6 caught both arbitrary-cmd-exec and validation-bypass BLOCKERs at $24.59; zero-sonnet missed both at $8.89
A workflow you can run nightly in CI, audit by reviewer, hand to a junior operator v0.3.6 The architected packet is a committed artifact with per-lens trace; zero-x is per-run improvisation
To avoid the worst-case cost trap (over $30 on a small job) v0.3.6, never v0.2 v0.2's per-file loop cost $33.79 on the same rename v0.3.6 did for $10.40 — bounded variance is the real win

Per-cell scorecard (raw $, quality, weighted ROI):

Cell Cost Quality ROIraw ROIweighted What it bought
S3-v0.3.6 $10.40 10/10 0.96 0.96 Bounded-variance rename via S7 BRIDGE
S3-zero-sonnet $4.81 10/10 2.08 2.08 Cheapest correct path on a deterministic batch
S3-v0.2 $33.79 9/10 0.27 0.27 Cautionary tale — per-file loop
S3-zero-opus $41.01 2/10 0.05 0.02 Fuel-burned-on-a-wrong-symbol failure
S2-v0.3.6 $9.80 8/10 0.82 2.45 Verifier-confirmed precision (only cell)
S2-zero-sonnet $6.20 7/10 1.13 2.90 Cheapest reasonable audit
S2-v0.2 $9.42 7/10 0.74 2.34 Five lenses, no verifier
S2-zero-opus $33.01 8/10 0.24 0.73 Pays Opus premium without measurable return
S1-v0.3.6 $24.59 9/10 0.37 2.28 Only cell that caught both supply-chain BLOCKERs
S1-zero-sonnet $8.89 6/10 0.67 2.59 Reasonable review, missed BLOCKERs
S1-v0.2 $3.94 3/10 0.76 2.79 Under-completed (27-line stub)
S1-zero-opus $19.82 7/10 0.35 1.36 Opus premium does not pay back vs. Sonnet
S3-v0.3.7 $14.30 10/10 0.70 0.70 Substrate path identical to v0.3.6; cell overhead from fixture rebuild
S2-v0.3.7 $7.10 8/10 1.13 3.38 Best ROI in the matrix; 0 spawns (AUDIENCE BOUNDARY refusal)
S1-v0.3.7 $18.53 8/10 0.43 2.27 5/5 CAVEMAN_FULL on the wire; −24.6% vs v0.3.6 at near-parity

Higher ROI is better. ROIweighted uses BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3. Full ROI definition in REPORT.md § The ROI function.


Why this experiment exists

Genesis is a skill an LLM-driven agent loads to architect other agentic workflows. When a team says "design us a PR-review panel" or "audit our docs corpus", the genesis architect decides how many sub-agents to spawn, which model each runs on, what tools each can call, and how to wire their outputs.

Before this PR, the architect optimised for correctness only. Cost was implicit, and that made the architect prone to two failure modes we kept observing in practice:

  1. Naive fan-out — spawning a 5-lens panel of Opus sub-agents for a task that a single Sonnet shell call could finish. Quality unchanged, cost 10–50× higher.
  2. Wrong-primitive binding — designing a per-file view + edit loop for a 19-file rename instead of a single perl -i shell invocation. The agent does the work; it just does it through the wrong tool surface.

The patterns added in v0.3.6 give the architect explicit vocabulary for both: B12 SELECTION RULE (how to pick the right primitive), B15 TOOL SUBSET + S7 DETERMINISTIC TOOL BRIDGE (when a shell call replaces N file edits), A12 GRADIENT WORKFLOW + benchmark-grounded model catalog (Sonnet for the bulk, Opus only for narrow arbiter roles), and A1 PANEL + A7 ADVERSARIAL VERIFIER (multi-stream reviewability with precision evidence). The question this PR answers is: do those patterns actually move the needle in production-grade conditions, on real workloads, against honest baselines?


How we measured

Three scenarios, each run four ways:

The four cells per scenario are zero-opus and zero-sonnet (single-shot prompt to a frontier model, no architecture), v0.2 architected (workflow designed by the previous-version architect), and v0.3.6 architected (workflow designed by this PR's architect with the new pattern vocabulary).

Cost is the real-telemetry sum of usage_input_tokens, usage_output_tokens, usage_cache_read_tokens, and usage_cache_write_tokens for the executor session and any sub-agents it spawned, priced at Anthropic public rates. Per-cell cost-report.json files are committed alongside chat-session ids for replay. The architect session is not counted — architecting is amortised, infrequent infrastructure.

Quality was graded by the orchestrator against scenario-specific rubrics: binary correctness plus tool-mechanic cleanliness for S3, real structural drift caught + verifier precision for S2, actionable bugs caught + severity calibration for S1. Full grading detail with the per-finding evidence is in REAL-TELEMETRY-RESULTS.md.

A buyer's decision is return on each dollar, not absolute spend, so we score every cell on three ROI axes — raw QualityScore / Cost, severity-weighted (BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3), and tail-risk-adjusted (QualityScore / (Cost + P(failure) × C_failure)). The first answers "what does the dollar buy?"; the second weights findings by impact-class; the third dominates when missing one finding costs orders of magnitude more than the run itself. Higher is better on all three. The full definition with inputs and caveats is in REPORT.md § The ROI function.


What we found — three findings, in plain language

1. Bad architecture costs more than no architecture

The S3 rename cell where the v0.2 architect designed a per-file view + edit loop cost $33.79 in real telemetry. The same rename, executed by Sonnet directly with one sed shell call, cost $4.81. Same outcome, 7× the cost, because every tool call round-trips the entire conversation context back to the model and a 19-file rename done one file at a time pays that overhead 19 times.

The v0.3.6 architect, given the same brief, rejected the per-file design and chose a single grep | xargs perl -i shell call (the S7 DETERMINISTIC TOOL BRIDGE pattern). It cost $10.40 — 3.2× cheaper than v0.2 and within 2× of zero-sonnet — and the architecture overhead bought bounded variance: no risk of the per-file loop trap, no risk of substituting the wrong symbol the way zero-opus did. This is the load-bearing claim of the PR: the new pattern vocabulary makes the architect reject wrong-tool-surface designs, and the gap is measured, not modelled.

2. On simple workloads, single-shot Sonnet is the cost-rational baseline

On S2 doc-audit and S1 PR review, zero-sonnet wins on raw $/quality. Sonnet at $6.20 produced a 7/10 audit. The v0.3.6 architecture at $9.80 produced an 8/10 audit. Sonnet at $8.89 produced a 6/10 PR review. The v0.3.6 architecture at $24.59 produced a 9/10 review.

Architecture has a fixed overhead. Every sub-agent dispatch reloads the Copilot CLI tool descriptions, the system prompt, and the genesis skill bundle — roughly 80K input tokens per turn that the single-shot path does not pay. On workloads where one Sonnet pass already produces good output, that overhead is dead weight. The v0.3.6 architect knows this: its S2 doc-audit handoff explicitly chose monolithic single-session over fan-out, citing the B12 SELECTION RULE wrong-primitive-binding warning.

What architecture buys on these workloads is not lower mean cost. It is multi-stream reviewability (five independent lens verdicts instead of one prose blob), class-routed quality (Sonnet on reviewer lenses, Haiku on trivial ones), and — uniquely on v0.3.6 — adversarial verification. The S2-v0.3.6 cell is the only one with a verifier pass that confirmed 6/6 HIGH findings with zero downgrades. That is the kind of precision evidence a buyer cannot get from a single-shot prompt at any cost.

3. On production-critical workloads, severity weighting flips the verdict

The S1 PR review is the test case where ROI_raw and ROI_tail diverge. Three of the four cells produced reasonable reviews. Only v0.3.6 caught the two supply-chain security BLOCKERs — arbitrary command execution via plugin-controlled LSP config, and validation bypass via the raw-dict install path. zero-opus and zero-sonnet both missed both. v0.2 under-completed the workflow entirely (a 27-line YAML stub with three bullet points).

On raw $/quality, zero-sonnet still wins (0.67 vs v0.3.6's 0.37). On severity-weighted ROI, the gap closes (zero-sonnet 2.59 vs v0.3.6 2.28 weighted-points-per-dollar) — the BLOCKER × 5 weighting moves v0.3.6's 56 weighted points to nearly twice the next cell's. On tail-risk-adjusted ROI, v0.3.6 dominates: the cost of merging a supply-chain CVE into a published CLI is at minimum thousands of dollars in remediation and reputation, against an architecture premium of ~$15. The math says: pay the premium when C_failure ≫ Cost, do not pay it when it does not. That is precisely the conclusion the v0.3.6 architect's catalog now encodes.

The full per-cell numbers, including ROI_raw, ROI_weighted, and the failure-mode column for each cell, are in the scorecard at the top of REPORT.md. Cost methodology is in PROFILING-PROTOCOL.md. Full grading detail with the per-finding evidence is in REAL-TELEMETRY-RESULTS.md.



Per-pattern attribution — which patterns moved which dollars

Pattern Empirical effect Confidence
S7 DETERMINISTIC TOOL BRIDGE $23.39 saved on S3 (v0.2 $33.79 → v0.3.6 $10.40) — same brief, same fixture Strong
B12 SELECTION RULE + explicit model: declarations ~$3–5 saved on S1 by routing 63 trivial-lens calls to Haiku ($2.01) instead of Sonnet Strong — per-model breakdown in cost-report.json
A12 GRADIENT WORKFLOW Applied in S1; refused by the architect in S2 and S3 (gradient-free workflows). The refusal is itself the win on simple workloads Strong-by-omission
B15 TOOL SUBSET Visible at the Haiku tier (6K input floor vs. 35K Sonnet sub-agent) Partial — only effective on tier-aware agent types
A1 PANEL + A7 ADVERSARIAL VERIFIER Quality patterns, not cost patterns. Spend $3.60 (S2) and $15.70 (S1) over zero-sonnet for verifier-confirmed precision and BLOCKER-catching Strong — only S2-v0.3.6 has a verifier pass; only S1-v0.3.6 caught the BLOCKERs
B13 CACHE-AWARE PREFIX No measurable differentiation — all cells already cache at 93–99% None — currently a guideline, not a lever
B14b CAVEMAN BRIEF Empirically load-bearing in v0.3.7 (this PR). 5/5 CAVEMAN_FULL on the wire on S1; cost delta −24.6% on S1, −27.5% on S2N at quality parity. Receipt audit at S1-v037-real/caveman-classification.md Strong — receipts queryable, cost delta in real telemetry

The honest summary: S7 + B12 + B15 + A12 + B14b (new in v0.3.7) are doing the work; B13 + B16 are catalogued but not yet pulling weight. Full attribution and v0.4 frontier (5 quick wins) in REAL-TELEMETRY-RESULTS.md § Per-pattern attribution and § v0.3.6 flaws and quick wins.


Reliability and predictability — the dimension ROI does not capture

Raw $/quality scores a single cell on a single run. It misses three properties that matter once the question shifts from "which is cheapest for this one task?" to "which workflow do I want in a CI pipeline run nightly, or handed to a junior engineer?":

  • Variance bounds. Single-shot Sonnet has no worst-case floor. The architected workflow does — every lens runs, every verifier confirms, the worst case is the worst lens. n=1 here so we cannot measure variance directly, but the structural argument holds.
  • Auditability. S1-v0.3.6 produced 5 lens artifacts + 1 synthesizer + 1 verifier, all committed. Zero-sonnet produced one prose blob. For workloads that gate production decisions (security review, compliance audit, release sign-off), the multi-stream artifact is a working audit trail; the blob is not.
  • Repeatability. The architected workflow is a committed packet plus N committed sub-agents. A different operator, on a different day, running the same packet against the same brief, gets a comparable workflow shape. The single-shot prompt is improvisational — the prompt typed once in a chat session is not a reusable artifact unless a human turns it into a .agent.md and adds tests, which is exactly what the architect already does.

This shifts the buyer's calculation. For one-off use, zero-sonnet is fine. For productionised, repeatable, automatable use, v0.3.6 is the only thing on the table — because it is the only one that produces a workflow artifact at all. Detail in REAL-TELEMETRY-RESULTS.md § Reliability and predictability.

What changed in the corpus

  • New patterns in skills/genesis/assets/design-patterns.md: B12 SELECTION RULE, B13 CACHE-AWARE PREFIX, B14b CAVEMAN BRIEF, B15 TOOL SUBSET, B16 EFFORT GOVERNOR, S7 DETERMINISTIC TOOL BRIDGE
  • New architectural pattern A12 GRADIENT WORKFLOW
  • Model catalog with benchmark-grounding reference (vals.ai SWE-bench data)
  • Empirical-proof corpus at dev/empirical-proof/ — twelve runtime cells with real-telemetry validation, formal ROI definition, per-cell artifacts and chat-session ids for replay

What this PR does NOT prove

This is a small, opportunistic study. We have n=3 scenarios and n=1 per cell. The conclusions above are about the shape of the cost-quality curve under v0.3.6 patterns, not a SWE-bench-grade benchmark. Specifically:

  • It is not a claim that v0.3.6 is always cheaper. It is not, on raw $, for S1 and S2.
  • It is not a claim the catalogue of patterns is complete. Telemetry shows every sub-agent dispatch pays a fixed entry tax of roughly 6K tokens (Haiku/explore), 35K (Sonnet sub-agent), or 54K (full orchestrator) before the prompt is read, and naive per-file loops amplify it via rolling re-send — S3-v0.2 burned a 290:1 input/output ratio for that reason. No pattern in this PR compresses that surface; it is the visible frontier for the next PR. The full breakdown, with per-bucket turn counts and per-cell I/O ratios, is in REAL-TELEMETRY-RESULTS.md.
  • Quality grading is orchestrator-judged, not held against a human gold standard. We are confident about the ordinal rankings and the BLOCKER findings on S1; the absolute /10 scores are best-effort.
  • P(failure|cell) in ROI_tail is the empirical 0/1 of this single run, not a calibrated probability. The tail-risk argument relies on the structural claim that C_failure ≫ Cost for security-critical workloads, not on a fitted failure-rate model.

Replaces the analytical projection in scenario-pr-review-panel.md with
ground-truth measurement from Copilot CLI per-turn telemetry.

Result: same panel job, same PR, same harness:
  - Executor A (v0.2.0 design):  164 turns, 8.71M tokens, $5.01
  - Executor B (v0.3.0 design):   89 turns, 4.07M tokens, $3.02
  - 1.66x cheaper, no critical-finding regression

Adds:
  - tools/profile-tokens.py  (Copilot CLI usage-block parser)
  - measurements/            (7 prior sessions of this work-stream)
  - ab-experiment-apm-1424/  (the controlled A/B + REPORT.md)

The 15x headline from the prior projection is explicitly retracted:
real reductions come mostly from B13 PROMPT THRIFT + tool-subset
discipline (turn-count drop), not from model routing or aggressive
cache tricks. Cache discipline (B12) is real and measurable at 90%+
in every session profiled, but it is an enabler not a multiplier.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel changed the title Empirical proof: real A/B measurement of v0.2.0 vs v0.3.0 panel cost on microsoft/apm#1424 Empirical proof: v0.2.0 vs v0.3.0 panel — measured 1.66x cost reduction on microsoft/apm#1424 (FinOps report) May 29, 2026
danielmeppiel and others added 19 commits May 29, 2026 14:03
Mirrors the PR #12 description so the ground-truth report is
version-controlled alongside the data it cites.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Triggered by the empirical A/B in PR #12: Executor B declared
role classes per agentic element but bound every element to
Sonnet 4 (the session default), missing B12 MODEL ROUTER.
Root cause: the v0.3.0 corpus did not say loudly enough that
on Copilot, the per-element binding site for model/tools is
.agent.md custom-agent frontmatter -- SKILL.md does NOT accept
those fields and silently ignores them.

Fixes across three files:

skills/genesis/assets/runtime-affordances/per-harness/copilot.md
- Cite canonical custom-agents-configuration docs URL
- PERSONA SCOPING FILE section: enumerate full .agent.md
  frontmatter (model, tools, target, disable-model-invocation,
  user-invocable, mcp-servers, metadata) with field-by-field
  explanation pointing each cost lever at its binding site
- MODULE ENTRYPOINT (SKILL) section: explicit IMPORTANT block
  that SKILL.md does NOT support model: or tools:; spell out
  the architectural consequence (restructure as .agent.md if
  per-element binding is needed)
- Section 9 Cost-pattern bindings: rewrite B12, B15, B16 to
  name .agent.md as the BINDING SITE with an explicit
  SKILL-LEVEL ROUTING ATTEMPT anti-pattern

skills/genesis/assets/runtime-affordances/model-catalog.md
- 'What this file does NOT do': add bullet noting binding
  site is harness-specific and the architect must consult
  the per-harness adapter, citing PR #12 as the failure
  mode if missed
- 'How per-harness adapters extend this file': require
  adapters to NAME THE PER-ELEMENT BINDING SITE explicitly

skills/genesis/assets/design-patterns.md
- B12 MODEL ROUTER: new WRONG-PRIMITIVE BINDING anti-pattern
  with PR #12 as the worked example
- B15 TOOL SUBSET: same anti-pattern (mirrors B12 failure mode)

Companion PR description updated to flag the architect-B miss
explicitly and to propose the B12-firing rerun as the next
iteration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Third controlled run, fixed-corpus rerun:

- Architect C (v0.3.1, fixed binding-site corpus) produced a 339-line
  handoff with explicit per-element MODEL BINDING TABLE: 6 of 7 sites
  bound to non-default SKUs (5 lenses + arbiter, planner=Opus,
  trivial=Haiku, others=Sonnet).
- Executor C honored the binding by passing model: to each task
  sub-agent dispatch. Telemetry confirms 3 distinct models in real
  billing: 15 turns on claude-haiku-4.5, 37 on claude-sonnet-4.6,
  29 on claude-opus-4.7.
- Output quality parity with Exec A and B: catches the LSP env-RCE
  CRITICAL plus 1 HIGH + 9 MEDIUM + 9 LOW.

Key empirical finding: B12 MODEL ROUTER fires as designed at the
sub-agent level (~10x cheaper per-turn on Haiku-bound trivial lenses)
but the orchestrator's session-default model dominates total cost
when it is more expensive than the routed sub-agents. In this run
the executor session ran on Opus 4.7 by default; the 29 orchestrator
turns alone cost $6.14, swamping the $2 saved by Haiku routing.

Counterfactual with orchestrator bound to Sonnet (matching handoff
intent): $2.73, which would be 10% cheaper than Exec B's $3.02.

This generates a new corpus lesson for v0.3.2: B12 must include the
orchestrator thread as a binding site. PR description (and REPORT.md
mirror) updated with full 3-way FinOps analysis and the proposed
v0.3.2 corpus addition.

Artifacts:
- architect-C-v0.3.1-handoff.md
- executor-C-v0.3.1-review.md
- executor-C-tokens.json
- executor-C-process.log.gz (3MB, 19MB uncompressed)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lled orchestrator

The earlier 3-cell A/B/C runs all had Opus session-default orchestrators
that masked the corpus-level signal. This commit adds a clean 2-cell
A/B with the only variable being the genesis corpus version:

- Both architects: claude-opus-4.7 (design tier)
- Both executor orchestrators: claude-sonnet-4.6 (pinned)
- Same target PR: microsoft/apm#1424
- Same lens count: 5 (correctness, security, performance, style, test-coverage)

Result:
- Cell D (v0.1 baseline, no cost-aware corpus): $11.77 total, 52 findings, 6 BLOCKER
- Cell E (v0.3.1 treatment, B12 fires at 7 sites): $14.68 total (+25%), 61 findings, 4 CRITICAL + 10 more above threshold

Cell E catches 2 additional CRITICAL security findings Cell D missed
(SEC-001 TOCTOU symlink race, SEC-002 validated-object discarded).

Honest reframing: in Copilot CLI where task(explore) defaults to Haiku,
B12 promotes lenses UP from Haiku to Sonnet, which is a quality-routing
knob, NOT a cost-reduction knob. The v0.3.1 corpus over-applies B12 by
encouraging explicit binding at every .agent.md primitive. v0.3.2 must
add a B12 SELECTION RULE: bind explicitly only when stakes, portability,
or operator economic preference justifies it; otherwise trust the
harness default.

Artifacts added:
- architect-D-v0.1-handoff.md, architect-E-v0.3.1-handoff.md
- executor-D-v0.1-review.md, executor-E-v0.3.1-review.md
- executor-D-findings.json, executor-E-findings.json
- architect-D-process.log.gz, executor-D-process.log.gz
- architect-E-process.log.gz, executor-E-process.log.gz
- tools/profile-per-model.py (new per-model attribution profiler)
- REPORT.md updated to v4 with the clean 2-cell story

PR #12 body updated to match.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After Cell F (v0.3.2 SELECTION RULE) only got to +7% vs v0.1 baseline,
RCA #2 named the residual cost driver: synth-heavy dispatched to Opus
by default for cross-lens adjudication = HEAVY ADJUDICATOR anti-pattern.

v0.3.2.1 corpus edit (architectural-patterns.md §A12):
- HEAVY ADJUDICATOR anti-pattern: synthesis that reconciles already-
  produced lens findings is reviewer-class, not planner-class.
- "WHERE THE HEAVY ROLE BELONGS" cure paragraph: bind planner only on
  rare, narrow triggers (>=2 BLOCKERs + contradictory + same diff hunk;
  expected firing rate ~2-4%).
- Cell F named as the empirical canonical case.

Cell G result on microsoft/apm#1424:
- Architect G (Opus, 20 turns): $7.34
- Executor G (Sonnet orch, 179 turns): $2.85
  -> 115 Haiku ($0.91) + 64 Sonnet ($1.93) + 0 Opus ($0)
- TOTAL: $10.19 (-13.4% vs Cell D baseline $11.77)
- Opus arbiter correctly stayed dark (narrow trigger not met)
- Bug-finding parity with Cell D (same class of issues)
- 2 false-positive BLOCKERs caught and downgraded via inline gh-api
  verification (Sonnet orchestrator), avoiding wasteful Opus dispatch

Final cell table:
| Cell | Corpus  | Total   | Delta vs v0.1 |
| D    | v0.1    | $11.77  | baseline      |
| E    | v0.3.1  | $14.68  | +24.7%  (FAIL)|
| F    | v0.3.2  | $12.63  | +7.3%   (PARTIAL)|
| G    | v0.3.2.1| $10.19  | -13.4%  (WIN) |

This commit ships:
- The v0.3.2.1 corpus edit (HEAVY ADJUDICATOR anti-pattern in A12)
- Cell F + G architect handoffs, executor reviews, gzipped process
  logs, per-lens findings JSONs
- REPORT.md rewritten to v5 with full D/E/F/G arc, iteration narrative,
  mermaid diagrams for Cell G (winner) and Cell E (over-bound failure),
  and named load-bearing corpus edits

Corpus claim now empirically grounded: cost-aware genesis corpus
produces designs neatly cheaper than the unconscious v0.1 baseline,
with parity on bug-finding quality.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iscipline

Drift correction surfaced in PR review iteration. Cell G (v0.3.2.1)
landed on a -45% executor-cost shape by OMITTING model: and relying
on the Copilot CLI harness default for task(agent_type='explore').
That works on Copilot CLI today but is NOT engineering — it is
hope-based: not portable across harnesses, not predictable across
harness versions, no audit trail.

The actual cost anti-pattern is BIND-UP-WITHOUT-JUSTIFICATION
(forcing IMPLEMENTER/PLANNER on role classes that TRIVIAL/REVIEWER
work would meet — Cell E v0.3.1, +25% cost vs baseline). Explicit
binding that matches the harness default is PREDICTABILITY
DISCIPLINE, not ceremony.

Changes:
- skills/genesis/assets/design-patterns.md §B12: rule 3 of
  SELECTION RULE rewritten — DEFAULT == REQUIRED case is now BIND
  EXPLICITLY (predictability / portability / audit-trail); OMIT
  only as exception. CONSEQUENCE paragraph rewritten — well-
  designed B12 has MOST elements bound. CEREMONIAL BINDING anti-
  pattern replaced by BIND-UP-WITHOUT-JUSTIFICATION (still cites
  Cell E as in-corpus case); CEREMONIAL BINDING narrowed to copy-
  pasted bulk bindings without per-element role-class distinction.
- skills/genesis/assets/runtime-affordances/per-harness/copilot.md
  §9 B12 site: reframed 'CEREMONY' bullet to 'PREDICTABILITY
  DISCIPLINE', listing portability + predictability + audit-trail
  as reasons to bind explicitly even when it matches the default.
- skills/genesis/examples/06-cost-aware-panel.md: PROVENANCE
  WARNING rewritten — explains the Cell E vs F/G distinction and
  steers architects toward explicit model: declarations even on
  Copilot CLI.

Predictability probe (3 explore dispatches, /bin/bash.19 Haiku):
task(agent_type='explore') fired claude-haiku-4.5 reliably across
trivial/medium/complex prompts on Copilot CLI. Harness default IS
stable for complexity TODAY. BUT portability across harnesses
(claude-code, opencode, codex, cursor) is NOT verified — explicit
binding is the only portable discipline.

NOT in this commit:
- examples/04, references/cost-economics-process.md, token-
  economics.md (no stale CEREMONIAL BINDING references found).
- Genesis-audits-genesis pass (running async).
- Re-run with v0.3.3 corpus.
- REPORT.md v6 with executor-only framing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ty probe, surgical bloat removals (-248 lines)

Three concurrent corrections shipped together:

1. REPORT v6 -- EXECUTOR-ONLY cost framing.
   - Headline corrected from -13% (architect+executor bundled) to
     -45% (executor-only), the number the operator actually pays
     per run. Architect amortizes once across many runs of the
     designed workflow.
   - v5 was wrong by ~3x in the conservative direction; this
     correction tightens the empirical claim.
   - Predictability probe (/bin/bash.19, 3 explore dispatches, all fired
     Haiku) documented and validates Cell G's OMIT tactic as
     currently-not-broken on Copilot CLI -- not as recommended.
   - v0.3.3 reframe rationale explained: 'bind explicitly for
     PREDICTABILITY + PORTABILITY + AUDIT TRAIL' is the actual
     discipline; OMIT was a Copilot-CLI-only tactic that worked
     by accident. No re-run required because routing is identical
     (TRIVIAL -> claude-haiku-4.5 on Copilot CLI today, declared
     explicitly vs inherited).
   - Per-technique attribution and multi-scenario variance
     explicitly DEFERRED to follow-up PRs.

2. Genesis-audits-genesis pass -- surgical bloat removals.
   - Auditor (Opus 4.7, single architect cell) applied the
     genesis skill to audit the genesis corpus.
   - Recommendations: -720 to -930 lines projected; -248 lines
     applied this PR (other consolidations deferred for safety).
   - Removals: stance prose compression, step 3.2 sub-block
     collapse, copilot.md cost-pattern bindings -> table form,
     scaffolding removal across 3 files, B12 CONSEQUENCE block
     (pure restatement), war-story citation overcount.
   - Full audit at dev/empirical-proof/audit-v0.3.3/removal-list.md
     for follow-up consideration.

3. Continued v0.3.3 reframe propagation.
   - examples/06-cost-aware-panel.md PROVENANCE WARNING and
     dollar arithmetic removed (worked example reduced to
     qualitative pattern citations; concrete numbers belong in
     cost-economics-process.md step 6 template, not duplicated
     inline per example).

Net corpus delta (this commit): -468 +311 = -157 lines (corpus
itself, not counting REPORT/probe/audit artifacts).
Cumulative PR corpus delta: now ~+1700 net (was +1946 before
audit).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… appendix

Per operator feedback: the PR story is baseline vs head; intermediate
corpus iterations (v0.3.1, v0.3.2) belong in an appendix, not the
headline. Learnings preserved (RCA #1 lens fan-out leak, RCA #2
synth-heavy adjudicator leak, Cell E architecture diagram) — they
are how the BIND-UP-WITHOUT-JUSTIFICATION and HEAVY ADJUDICATOR
anti-patterns were discovered empirically rather than from first
principles. Confounded 3-cell A/B/C history moved to Appendix B.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Corpus additions:
- assets/architectural-patterns.md §A1 PANEL: add UNDIFFERENTIATED LENS BINDING
  anti-pattern. Forces per-lens CAPABILITY PROFILE enumeration before binding
  (cross-file reasoning? STAKES-weighted output? multi-step proof?).
- assets/design-patterns.md §B12 BULK IDENTICAL BINDING variant: strengthened
  to fire in BOTH directions (bulk-UP and bulk-DOWN). Cost direction is not
  the anti-pattern; lack of per-element reasoning is. Per-element CAPABILITY
  PROFILE template added; cross-references A1 PANEL.

REPORT additions:
- New Per-technique attribution section: isolates B12 (-$2.16 preventative),
  A12 (-$3.95 active, dominant win), B13 ($14 defensive) from existing
  4-cell D/E/F/G data via pairwise cell deltas. Honest framing: D->G -45%
  is mostly A12, not B12 (D accidentally inherited Haiku via harness default).
- New v0.3.4 PER-LENS DIFFERENTIATION subsection: explains the corpus change
  and why uniform Haiku binding on the 5 advisory lenses remains correct
  for this skill (after enumeration) but would differ on a verdict-emitting
  skill.
- Updated 'What this PR proves' (7 numbered points; per-technique +
  PER-LENS now included) and 'What this PR does NOT prove' (per-technique
  removed; B14/B15/B16 ablations + multi-scenario + cross-harness retained
  as explicit deferrals).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ED bindings

Dispatched Opus 4.7 architect on a deliberately different skill type
(release-notes-generator: A2 PIPELINE, 50-commit input, mixed sub-tasks).
Same persona, same v0.3.4 corpus, different problem shape.

Result: role-class distribution DIFFERENTIATED, not uniform.
- 2 TRIVIAL (classifier E1, batched bug-fix one-liners E4)
- 2 IMPLEMENTER (orchestrator E0, feature prose E2)
- 2 REVIEWER (breaking-change prose E3, consistency pass E5)
- 0 PLANNER

Architect explicitly flagged BIND-UP-WITHOUT-JUSTIFICATION risk on E4:
'wrongly slap-binding [30 bug-fix one-liners] to sonnet under
release-notes is user-facing would be BIND-UP-WITHOUT-JUSTIFICATION
and would inflate the per-run cost by roughly 30 premium requests.'
The v0.3.4 corpus discipline working in the wild.

Predicted L-scenario cost: ~$0.18-0.25 per run. A12 GRADIENT savings
vs flat-sonnet hypothetical: ~29 premium requests per run.

This validates that the PR-review panel's uniform Haiku binding is
NOT rubber-stamping; it is the discipline working correctly on
uniform-profile inputs. When profiles are heterogeneous (this run),
the discipline produces heterogeneous bindings.

Handoff packet persisted at dev/empirical-proof/scenario-release-notes/.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…S discipline

Dispatched v0.3+ panel against microsoft/apm#1541 (+41/-2, 2 files, small CLI fix).

Results:
- Executor cost: ~$0.21 (vs $2.85 on PR #1424)
- Per-kLoC: $5.12 (vs $1.15 baseline) — fixed Sonnet-executor overhead
  dominates at small scale (~93% of total cost)
- Panel cost shape holds in dollar terms; per-kLoC ratio inverted
- Arbiter trigger correctly did NOT fire (0 BLOCKERs)

KEY EMPIRICAL VALIDATION of v0.3.4 PER-LENS DIFFERENTIATION:
The executor reflected per-lens against the CAPABILITY PROFILE template
and concluded 4/5 lenses genuinely TRIVIAL, but security lens was
INADEQUATE on TRIVIAL/Haiku — it surfaced a real MEDIUM bypass concern
but could not validate it without out-of-diff function body access.

This empirically generates the per-element justification the v0.3.4
corpus requires architects to record at design time. Recommended carve-out:
'Security lens uses Haiku when all referenced functions are in-diff;
escalates to REVIEWER with tool access when it must reason about
out-of-diff internals.'

Implication for PR #1424: security lens was likely mis-bound to TRIVIAL.
The blocker false-positive (_substitute_plugin_root alleged undefined,
refuted only via out-of-diff gh api lookup) is consistent with TRIVIAL-
class inadequacy on cross-file reasoning. A v0.3.4 re-architect would
correctly bind security to REVIEWER, expected cost delta +$0.50-1.00
per run with measurable security finding fidelity improvement.

REPORT updated with multi-scenario section (small-PR + different-skill).
Deferral list narrowed: full S1-S5 × {v0.2,v0.3+} matrix remains follow-up.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Foundry docs

- model-catalog.md: add 'researcher' as a sixth role class for open-ended
  exploration (distinct from planner which optimises a stated goal, and from
  long-context-retriever which extracts from a known corpus). Capability
  profile: open-ended exploration with discovery-phase success criteria,
  multi-source synthesis, multi-hypothesis reasoning. Cost profile: highest
  tier + reasoning multiplier. Requires narrow-trigger discipline (cited
  STAKES) to bind, mirroring A12 arbiter discipline. Pattern matching !=
  research; if a rubric exists -> reviewer; if a plan exists -> planner.

- model-catalog.md: add 'OpenAI / GPT-5 family specifics' section grounded
  against Microsoft Learn Azure Foundry reasoning docs. Documents
  reasoning_effort values (none|minimal|low|medium|high|xhigh), defaults
  per SKU (gpt-5.1 defaults to none; gpt-5-pro defaults to and only supports
  high; xhigh only on models after gpt-5.1-codex-max), and the per-role-
  class binding table. Architect MUST declare reasoning_effort alongside
  SKU for OpenAI bindings (same SKU spans 2-3 role classes by effort).

- model-catalog.md: refresh all EXAMPLES sections with current GPT-5
  lineup (gpt-5.1, gpt-5-mini, gpt-5-pro, gpt-5-codex, gpt-5.1-codex-max)
  per Azure Foundry docs.

- design-patterns.md §B12: add researcher + long-context-retriever to the
  role-class enumeration in the SELECTION RULE.

- design-patterns.md §B16 EFFORT GOVERNOR: extend role-class to effort-
  level mapping table with researcher (high to xhigh) and long-context-
  retriever (low). Explicit anti-pattern: suppressing effort on researcher
  defeats the binding.

Grounding source: https://learn.microsoft.com/azure/foundry/openai/how-to/reasoning

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve-audit) × v0.2 vs v0.3.5

Six background Opus 4.7 dispatches, ~$30 architect spend. No executor
runs in this probe.

Scenarios chosen to stress different patterns:
- S1 apm-triage-panel: PANEL with arbiter -> B12 PER-LENS, A12 discipline
- S2 bulk-api-rename: STAFFED PLAN + batched edits -> S7 + B15 (CodeAct)
- S3 dependency-cve-audit: PIPELINE + per-CVE fan-out -> RESEARCHER class
  (v0.3.5), LONG-CONTEXT-RETRIEVER, multi-class binding

Three findings:

1. v0.2 architects identify the gap, can't fill it. All three v0.2 cells
   enumerated cost-aware patterns they wanted but couldn't cite. The
   taxonomy gap is real and felt by the architect persona.

2. v0.3.5 discipline produces heterogeneous outputs on heterogeneous
   inputs. S2 binds 1 element (IMPLEMENTER, BIND DOWN). S3 exercises 5
   of 6 role classes. S1 differentiates 3 TRIVIAL + 3 REVIEWER after
   enumerating 6 capability profiles. 'All-Haiku' from PR #1424 review
   was correct for that specific input, not the discipline's default.

3. RESEARCHER class fires once across three scenarios, exactly as
   designed. Two of three v0.3.5 cells explicitly REJECT it with cited
   rule ('rubric exists -> REVIEWER'). Only S3 binds it with full
   STAKES citation. Narrow-trigger discipline working.

S2 carries a concrete per-technique number: tools: [read, execute]
structurally excludes the edit tool so the naive 50+-edit-turn
anti-pattern is impossible by construction. apply-rename.sh script
batches in one turn. Projected $0.05-0.10 per L run vs $0.50-1.20
naive (~10x saving from S7 + B15).

Artifacts: dev/empirical-proof/cross-scenario/{S1-triage,S2-rename,
S3-cve}-{v02,v035}/handoff.md (6 files, ~3700 lines total). Each
contains 3 mermaid diagrams, interface sketches, module composition
table, per-element model bindings, patterns cited, cost projections.

REPORT updated with new 'Cross-scenario architect A/B' section + the
'What this PR does NOT prove' deferrals narrowed (B15 attribution now
in scope, executor matrix still deferred).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7 cells dispatched on real fixtures (S1 triage panel; S2 bulk
rename). Each cell measured / modeled cost from sub-agent transcripts
against Anthropic published $/Mtok rates.

Per-pattern measured savings:
- B15 + S7 TOOL SUBSET + CodeAct: 75x on S2 ($3.97 -> $0.053)
- B12 PER-LENS ROUTING:           2.27x on S1 ($0.540 -> $0.238)
- B14 CAVEMAN BRIEF (proposed):   1.81x on severity lens, 75% verdict
                                  agreement, only upward escalations
- B16 EFFORT GOVERNOR:            1.72x + quality control (prevents
                                  TRIVIAL-lens severity inflation)
- B13 CACHE-AWARE PREFIX:         modeled ~2.5x on cached portion

REPORT rewritten in Minto-pyramid form (headline -> three takeaways
-> per-pattern table -> why -> per-pattern attribution -> proposed
sub-pattern -> caveats). The bloated chronological REPORT is
preserved as APPENDIX-iterative-history.md.

PR body mirrors the new REPORT structure: someone who has not
worked on the experiment can read it top-to-bottom and get the
measurement, the why, the per-pattern numbers, and the caveats.

CAVEMAN sub-experiment surfaces a candidate B14b CAVEMAN BRIEF
sub-pattern for TRIVIAL-class classifiers: 45% input-token saving,
zero downward severity errors, two defensible upward escalations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…idated)

Promoted from the cost-attribution experiment's caveman cell:
44.6% input-token saving on TRIVIAL severity classification, 75%
agreement with verbose brief, zero downward errors. Gated to
TRIVIAL class only (REVIEWER+ keep prose).

Includes:
- WHEN gate (TRIVIAL + fixed schema output)
- mechanism + anchoring rule (extreme-bucket grounding)
- measured effect citation (cost-report.json reference)
- CAVEMAN ON REVIEWER anti-pattern

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 'FinOps view: scenarios at a glance' table near top with
  per-scenario v0.2 vs v0.3.5 cost and links to underlying evidence
- Add ablations sub-table linking each B-pat cell artifact
- Add side-by-side mermaid diagrams for S1 (triage) and S2 (rename)
  showing v0.2 vs v0.3.5 architectures with pattern annotations
  on the exact nodes/edges where each pattern applies
- Hyperlink every pattern citation to its definition in
  design-patterns.md / architectural-patterns.md
- Hyperlink every measurement claim to its cost-report.json
- Self-contained: reader can verify any number without leaving
  this document

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changes:
- Fix mermaid render error: quote all node labels containing brackets,
  pipes, parens (previous version used unquoted [read, execute] inside
  cylinder shape, parser failed)
- Add 'For a reader new to this work' prose section explaining what
  genesis is, the three cost drivers, and what models cost
- Refine all 4 architecture diagrams:
  * Annotate architect model + reasoning effort + role
  * Annotate every sub-agent with model AND why (e.g. why synthesizer
    is Sonnet not Haiku, why executor is Sonnet not Haiku)
  * Show exact sub-agent spawn count and tool call count per cell
  * Show TRIVIAL/REVIEWER routing as labelled edges with dispatch
    counts (24 TRIVIAL + 24 REVIEWER) instead of abstract symbols
  * Show context-growth as explicit annotation on v0.2 S2 diagram
- Add per-scenario explanatory prose: what the workflow does, why
  the architecture matters, why each model choice was made
- Hyperlink every pattern and every claim to its source

The report now reads top-to-bottom for someone with no prior context
on genesis, the patterns, or the experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds dated empirical anchoring for Haiku / Sonnet / Opus role-class
choices on coding and autonomous-loop boundaries.

- model-catalog.md: two 3-line load triggers pointing to the new
  reference (file load + Routing-axes B12 citation).
- references/benchmark-grounding.md: cross-benchmark table
  (SWE-bench Verified, Terminal-Bench 2.1, Vals Index), SWE-bench
  task-length bucket view, and SONNET-AVERSION / HAIKU-PROMOTION
  cited as WRONG-PRIMITIVE BINDING instances at B12.

Architected by the genesis skill on Opus 4.7 (verified 2026-05-29).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nding refs

- S2 replaced with doc-audit (4 cells: zero-Opus, zero-Sonnet, v0.2, v0.3.5)
- Surface the +27% fan-out tax v0.3.5 pays on doc-audit honestly
- Add Why-we-observe subsection explaining the TRIVIAL-surface threshold
- Per-pattern table now shows B12 negative-net on doc-audit
- NOT-proven section calls out missing fan-out threshold derivation
- Appendix links new S2N cells + benchmark-grounding reference
- Old bulk-rename relabeled Scenario 3 (extreme floor case)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel changed the title Empirical proof: v0.2.0 vs v0.3.0 panel — measured 1.66x cost reduction on microsoft/apm#1424 (FinOps report) Token economics in genesis: measured cost patterns with one honest counter-example May 29, 2026
…ion)

Adds top-level REAL-TELEMETRY-RESULTS.md as ground-truth absolute-cost
report. All numbers harvested from cloud session_store events table
(real usage_input_tokens / output_tokens / cache_read_tokens /
cache_write_tokens), priced at Anthropic public rates.

Matrix: 3 scenarios x 4 cells (zero-opus, zero-sonnet, v0.2 architected,
v0.3.6 architected). Architect sessions (Opus 4.7) NOT counted per
profiling protocol -- architecting is amortised across runs.

Headline findings (all measured, see file for cell-level cost-reports):
- S3 bulk rename: v0.2 per-file-edit anti-pattern ($33.79) is 7x
  v0.3.6 S7 TOOL BRIDGE design ($10.40). v0.3.6 pattern WORKS.
- S2 doc-audit: v0.3.6 architect chose MONOLITHIC (rejecting fan-out
  via B12 SELECTION RULE); ties v0.2 ($9.80 vs $9.42).
- S1 PR review: v0.3.6 5-lens fan-out ($24.59) costs more than
  zero-sonnet ($8.89) -- architecture buys reviewability and
  class-routed quality, not absolute cost win.
- zero-opus is the real anti-pattern: 3-8x zero-sonnet cost with
  marginal quality justification on these workloads.

Files added:
- dev/empirical-proof/REAL-TELEMETRY-RESULTS.md (new)
- dev/empirical-proof/PROFILING-PROTOCOL.md (new)
- dev/empirical-proof/scenario-runs/results/{S1,S2N,S3}-*-real/cost-report.json (12 new)
- dev/empirical-proof/scenario-runs/results/REAL-COSTS-summary.csv (new)

REPORT.md banner now points readers to the new file first, retains
the size-modeled study as a pattern-isolation ablation (ratios valid,
absolute $ understated 50-200x due to ignored tool-surface overhead).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel changed the title Token economics in genesis: measured cost patterns with one honest counter-example Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells May 29, 2026
danielmeppiel and others added 6 commits May 29, 2026 23:13
…RESULTS

Grade all 12 cell outputs against scenario-specific rubrics and compute
$/quality-unit. Adds the buyer-side ROI framing the cost-only matrix
was missing.

Headline takeaways:
- S3: zero-sonnet has best raw $/quality; v0.3.6 is 2x cost but
  eliminates the failure-mode tail (zero-opus failed task at $41,
  v0.2 burned 40 calls on what should be 1 shell call).
- S2: zero-sonnet best raw $/quality; v0.3.6 ships verifier-confirmed
  precision (6/6 HIGH confirmed, 0 downgrades).
- S1: v0.3.6 caught 2 supply-chain BLOCKER security findings the other
  3 cells missed. On severity-weighted ROI it ties zero-sonnet; on
  raw $/quality it is 1.85x more expensive.

Verdict reframe: v0.3.6 is insurance, not optimisation. Pays 1.5-2.5x
on workloads where it does not produce unique findings; on workloads
where it does (S1 supply-chain BLOCKERs, S3 anti-pattern rejection),
avoided cost is 10-100x the architecture premium.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add formal ROI definition (ROI_raw, ROI_weighted, ROI_tail) with
severity weights (BLOCKER=5, HIGH=3, MED=1, LOW=0.3) and a per-cell
12-row scorecard at the top of both REPORT.md and
REAL-TELEMETRY-RESULTS.md. Restructure prose grading to live below
the scorecard.

Buyer-facing math:
- ROI_raw = QualityScore / Cost
- ROI_weighted = (Sum severity_weight x findings) / Cost
- ROI_tail = QualityScore / (Cost + P(failure) x C_failure)

Per-cell scorecard surfaces:
- S3-zero-opus ROI_raw 0.05 (failed task at $41)
- S3-zero-sonnet best ROI_raw at 2.08
- S1-v0.3.6 weighted=56pts (2x next best) — only cell catching
  2 supply-chain BLOCKERs

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds concrete per-cell evidence for the dispatch-overhead claim:
- First-turn input by harness: 54K (Sonnet) / 75K (Opus)
- Per-spawn entry tax by agent type: 6K Haiku, 35K Sonnet sub, 54K orch
- Rolling re-send pattern (S3-v0.2): 54K→130K input growth, 290:1 I/O ratio
- I/O ratio table across all 10 cells with non-trivial telemetry

Mining done from the events table assistant.usage stream of each cell
chat session. Numbers are real, real-billed, and reproducible by re-querying
with the chat_session_id committed in each cost-report.json.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three new sections answering buyer questions:
- Per-pattern attribution: which patterns moved which dollars (S7 +B12+B15+A12 load-bearing; B13/B14b/B16 catalogued but not
  pulling weight yet)
- v0.3.6 flaws and quick wins (v0.4 frontier — entry-tax visibility,
  dispatch coalescing A13, B14b CAVEMAN, B15 tier-aware Sonnet,
  loop-detector for B12)
- Reliability and predictability — the dimension raw ROI misses
  (variance bounds, auditability, repeatability) — argues v0.3.6
  is the only thing on the table for productionised/automated use

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactors the genesis skill so the B14b CAVEMAN BRIEF pattern actually
operates inside the agentic workflows genesis designs. Telemetry showed
v0.3.6 had 0/9 caveman briefs in S1 task() dispatches — caveman was a
catalogue stub, not a runtime contributor.

Core idea: AUDIENCE BOUNDARY (composition-substrate §7) — every artifact
named INTERNAL or EXTERNAL; default INTERNAL=caveman, EXTERNAL=normal
prose; boundary is artifact-audience, not agent-tier. Caveman was
designed for human↔assistant prose; genesis spawns subagents the user
never sees. INTERNAL traffic is safe and ideal for caveman; EXTERNAL
traffic stays prose.

Changes:
- composition-substrate.md §7 AUDIENCE BOUNDARY (substrate-level rule)
- design-patterns.md B14b expanded to canonical fidelity (drop list,
  preservation contract for code/paths/URLs/numbers/error strings,
  LITE/FULL/ULTRA intensities, auto-clarity exceptions, role-mode
  persistence, output-mode contract)
- design-patterns.md B14c CAVEMAN CHANNEL (orchestrates the boundary
  across multi-spawn workflows)
- SKILL.md: per-spawn declaration discipline + audience preamble
- references/audience-boundary.md NEW (load-on-demand audience matrix)
- assets/caveman-templates.md NEW (five brief templates + receipt
  schemas: severity, dup-oracle, label-picker, missing-info, style)
- pattern-tradeoffs.md: caveman-on-external anti-pattern
- refactor-patterns.md R6 AUDIENCE-BOUNDARY ENFORCE checklist
- model-catalog.md TRIVIAL row → B14b/B14c cross-link

ROI: concentrated on PANEL-shape workflows. S1-style ~43K input + ~14K
output saved/run (~$0.34 uncached). S2/S3-shape workflows correctly
gain 0; v0.3.7 does NOT pressure architects to invent spawns.

Audit: files/caveman-audit.md
Design plan: files/v037-design-plan.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three handoff packets produced by Opus + genesis v0.3.7 architect runs.
These are the design-only outputs; execution will be done by separate
Sonnet executor sessions to replicate the v0.3.6 two-session pattern
(architect on opus, executor on sonnet) — apples-to-apples cost A/B.

Packets:
- S1-triage-v037/handoff.md: 5-spawn panel (CAVEMAN_FULL briefs)
- S2N-v037/handoff.md: monolithic A9+A7 (0 spawns, refusal validated)
- S3-rename-v037/handoff.md: S7 short-circuit (0 spawns)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three cold-start executor cells (S1, S2N, S3) committed with REAL
TELEMETRY harvested from the cloud events table for executor chat
sessions 82be8bfe / 41405aae / 99c365c2. Cost-report.json files are
authoritative (executor-only, architect-cost out-of-scope per the
established v0.3.6 methodology).

Headline numbers (executor-only, real telemetry):
  S1-v0.3.7  $18.53  (vs v0.3.6 $24.59,  -24.6%) quality 8/10
  S2N-v0.3.7 $7.10   (vs v0.3.6 $9.80,   -27.5%) quality 8/10 (parity)
  S3-v0.3.7  $14.30  (vs v0.3.6 $10.40,  +37.5%) quality 10/10 (fixture-rebuild overhead)

Caveman receipt fidelity (load-bearing v0.3.7 substrate validation):
  S1-v0.3.7  5/5 CAVEMAN_FULL on the wire (vs v0.3.6's 0/9 PROSE_LEAK)
  S2N/S3     0 spawns by design (AUDIENCE BOUNDARY refusal validated)

RTR.md updated:
  - matrix table: v0.3.7 column added per scenario
  - per-cell scorecard: 3 v0.3.7 rows + summary
  - per-pattern attribution: B14b promoted to empirically-load-bearing
  - flaw 3 (B14b not invoked): marked RESOLVED
  - new section: v0.3.7 cold-start packet fidelity test

PR body updated separately via gh pr edit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@danielmeppiel danielmeppiel merged commit 0abf6f0 into main May 30, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant