Token economics in genesis v0.3.6: real-telemetry validation across 3 scenarios × 4 cells#12
Merged
Merged
Conversation
Replaces the analytical projection in scenario-pr-review-panel.md with ground-truth measurement from Copilot CLI per-turn telemetry. Result: same panel job, same PR, same harness: - Executor A (v0.2.0 design): 164 turns, 8.71M tokens, $5.01 - Executor B (v0.3.0 design): 89 turns, 4.07M tokens, $3.02 - 1.66x cheaper, no critical-finding regression Adds: - tools/profile-tokens.py (Copilot CLI usage-block parser) - measurements/ (7 prior sessions of this work-stream) - ab-experiment-apm-1424/ (the controlled A/B + REPORT.md) The 15x headline from the prior projection is explicitly retracted: real reductions come mostly from B13 PROMPT THRIFT + tool-subset discipline (turn-count drop), not from model routing or aggressive cache tricks. Cache discipline (B12) is real and measurable at 90%+ in every session profiled, but it is an enabler not a multiplier. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Mirrors the PR #12 description so the ground-truth report is version-controlled alongside the data it cites. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Triggered by the empirical A/B in PR #12: Executor B declared role classes per agentic element but bound every element to Sonnet 4 (the session default), missing B12 MODEL ROUTER. Root cause: the v0.3.0 corpus did not say loudly enough that on Copilot, the per-element binding site for model/tools is .agent.md custom-agent frontmatter -- SKILL.md does NOT accept those fields and silently ignores them. Fixes across three files: skills/genesis/assets/runtime-affordances/per-harness/copilot.md - Cite canonical custom-agents-configuration docs URL - PERSONA SCOPING FILE section: enumerate full .agent.md frontmatter (model, tools, target, disable-model-invocation, user-invocable, mcp-servers, metadata) with field-by-field explanation pointing each cost lever at its binding site - MODULE ENTRYPOINT (SKILL) section: explicit IMPORTANT block that SKILL.md does NOT support model: or tools:; spell out the architectural consequence (restructure as .agent.md if per-element binding is needed) - Section 9 Cost-pattern bindings: rewrite B12, B15, B16 to name .agent.md as the BINDING SITE with an explicit SKILL-LEVEL ROUTING ATTEMPT anti-pattern skills/genesis/assets/runtime-affordances/model-catalog.md - 'What this file does NOT do': add bullet noting binding site is harness-specific and the architect must consult the per-harness adapter, citing PR #12 as the failure mode if missed - 'How per-harness adapters extend this file': require adapters to NAME THE PER-ELEMENT BINDING SITE explicitly skills/genesis/assets/design-patterns.md - B12 MODEL ROUTER: new WRONG-PRIMITIVE BINDING anti-pattern with PR #12 as the worked example - B15 TOOL SUBSET: same anti-pattern (mirrors B12 failure mode) Companion PR description updated to flag the architect-B miss explicitly and to propose the B12-firing rerun as the next iteration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Third controlled run, fixed-corpus rerun: - Architect C (v0.3.1, fixed binding-site corpus) produced a 339-line handoff with explicit per-element MODEL BINDING TABLE: 6 of 7 sites bound to non-default SKUs (5 lenses + arbiter, planner=Opus, trivial=Haiku, others=Sonnet). - Executor C honored the binding by passing model: to each task sub-agent dispatch. Telemetry confirms 3 distinct models in real billing: 15 turns on claude-haiku-4.5, 37 on claude-sonnet-4.6, 29 on claude-opus-4.7. - Output quality parity with Exec A and B: catches the LSP env-RCE CRITICAL plus 1 HIGH + 9 MEDIUM + 9 LOW. Key empirical finding: B12 MODEL ROUTER fires as designed at the sub-agent level (~10x cheaper per-turn on Haiku-bound trivial lenses) but the orchestrator's session-default model dominates total cost when it is more expensive than the routed sub-agents. In this run the executor session ran on Opus 4.7 by default; the 29 orchestrator turns alone cost $6.14, swamping the $2 saved by Haiku routing. Counterfactual with orchestrator bound to Sonnet (matching handoff intent): $2.73, which would be 10% cheaper than Exec B's $3.02. This generates a new corpus lesson for v0.3.2: B12 must include the orchestrator thread as a binding site. PR description (and REPORT.md mirror) updated with full 3-way FinOps analysis and the proposed v0.3.2 corpus addition. Artifacts: - architect-C-v0.3.1-handoff.md - executor-C-v0.3.1-review.md - executor-C-tokens.json - executor-C-process.log.gz (3MB, 19MB uncompressed) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lled orchestrator The earlier 3-cell A/B/C runs all had Opus session-default orchestrators that masked the corpus-level signal. This commit adds a clean 2-cell A/B with the only variable being the genesis corpus version: - Both architects: claude-opus-4.7 (design tier) - Both executor orchestrators: claude-sonnet-4.6 (pinned) - Same target PR: microsoft/apm#1424 - Same lens count: 5 (correctness, security, performance, style, test-coverage) Result: - Cell D (v0.1 baseline, no cost-aware corpus): $11.77 total, 52 findings, 6 BLOCKER - Cell E (v0.3.1 treatment, B12 fires at 7 sites): $14.68 total (+25%), 61 findings, 4 CRITICAL + 10 more above threshold Cell E catches 2 additional CRITICAL security findings Cell D missed (SEC-001 TOCTOU symlink race, SEC-002 validated-object discarded). Honest reframing: in Copilot CLI where task(explore) defaults to Haiku, B12 promotes lenses UP from Haiku to Sonnet, which is a quality-routing knob, NOT a cost-reduction knob. The v0.3.1 corpus over-applies B12 by encouraging explicit binding at every .agent.md primitive. v0.3.2 must add a B12 SELECTION RULE: bind explicitly only when stakes, portability, or operator economic preference justifies it; otherwise trust the harness default. Artifacts added: - architect-D-v0.1-handoff.md, architect-E-v0.3.1-handoff.md - executor-D-v0.1-review.md, executor-E-v0.3.1-review.md - executor-D-findings.json, executor-E-findings.json - architect-D-process.log.gz, executor-D-process.log.gz - architect-E-process.log.gz, executor-E-process.log.gz - tools/profile-per-model.py (new per-model attribution profiler) - REPORT.md updated to v4 with the clean 2-cell story PR #12 body updated to match. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After Cell F (v0.3.2 SELECTION RULE) only got to +7% vs v0.1 baseline, RCA #2 named the residual cost driver: synth-heavy dispatched to Opus by default for cross-lens adjudication = HEAVY ADJUDICATOR anti-pattern. v0.3.2.1 corpus edit (architectural-patterns.md §A12): - HEAVY ADJUDICATOR anti-pattern: synthesis that reconciles already- produced lens findings is reviewer-class, not planner-class. - "WHERE THE HEAVY ROLE BELONGS" cure paragraph: bind planner only on rare, narrow triggers (>=2 BLOCKERs + contradictory + same diff hunk; expected firing rate ~2-4%). - Cell F named as the empirical canonical case. Cell G result on microsoft/apm#1424: - Architect G (Opus, 20 turns): $7.34 - Executor G (Sonnet orch, 179 turns): $2.85 -> 115 Haiku ($0.91) + 64 Sonnet ($1.93) + 0 Opus ($0) - TOTAL: $10.19 (-13.4% vs Cell D baseline $11.77) - Opus arbiter correctly stayed dark (narrow trigger not met) - Bug-finding parity with Cell D (same class of issues) - 2 false-positive BLOCKERs caught and downgraded via inline gh-api verification (Sonnet orchestrator), avoiding wasteful Opus dispatch Final cell table: | Cell | Corpus | Total | Delta vs v0.1 | | D | v0.1 | $11.77 | baseline | | E | v0.3.1 | $14.68 | +24.7% (FAIL)| | F | v0.3.2 | $12.63 | +7.3% (PARTIAL)| | G | v0.3.2.1| $10.19 | -13.4% (WIN) | This commit ships: - The v0.3.2.1 corpus edit (HEAVY ADJUDICATOR anti-pattern in A12) - Cell F + G architect handoffs, executor reviews, gzipped process logs, per-lens findings JSONs - REPORT.md rewritten to v5 with full D/E/F/G arc, iteration narrative, mermaid diagrams for Cell G (winner) and Cell E (over-bound failure), and named load-bearing corpus edits Corpus claim now empirically grounded: cost-aware genesis corpus produces designs neatly cheaper than the unconscious v0.1 baseline, with parity on bug-finding quality. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…iscipline Drift correction surfaced in PR review iteration. Cell G (v0.3.2.1) landed on a -45% executor-cost shape by OMITTING model: and relying on the Copilot CLI harness default for task(agent_type='explore'). That works on Copilot CLI today but is NOT engineering — it is hope-based: not portable across harnesses, not predictable across harness versions, no audit trail. The actual cost anti-pattern is BIND-UP-WITHOUT-JUSTIFICATION (forcing IMPLEMENTER/PLANNER on role classes that TRIVIAL/REVIEWER work would meet — Cell E v0.3.1, +25% cost vs baseline). Explicit binding that matches the harness default is PREDICTABILITY DISCIPLINE, not ceremony. Changes: - skills/genesis/assets/design-patterns.md §B12: rule 3 of SELECTION RULE rewritten — DEFAULT == REQUIRED case is now BIND EXPLICITLY (predictability / portability / audit-trail); OMIT only as exception. CONSEQUENCE paragraph rewritten — well- designed B12 has MOST elements bound. CEREMONIAL BINDING anti- pattern replaced by BIND-UP-WITHOUT-JUSTIFICATION (still cites Cell E as in-corpus case); CEREMONIAL BINDING narrowed to copy- pasted bulk bindings without per-element role-class distinction. - skills/genesis/assets/runtime-affordances/per-harness/copilot.md §9 B12 site: reframed 'CEREMONY' bullet to 'PREDICTABILITY DISCIPLINE', listing portability + predictability + audit-trail as reasons to bind explicitly even when it matches the default. - skills/genesis/examples/06-cost-aware-panel.md: PROVENANCE WARNING rewritten — explains the Cell E vs F/G distinction and steers architects toward explicit model: declarations even on Copilot CLI. Predictability probe (3 explore dispatches, /bin/bash.19 Haiku): task(agent_type='explore') fired claude-haiku-4.5 reliably across trivial/medium/complex prompts on Copilot CLI. Harness default IS stable for complexity TODAY. BUT portability across harnesses (claude-code, opencode, codex, cursor) is NOT verified — explicit binding is the only portable discipline. NOT in this commit: - examples/04, references/cost-economics-process.md, token- economics.md (no stale CEREMONIAL BINDING references found). - Genesis-audits-genesis pass (running async). - Re-run with v0.3.3 corpus. - REPORT.md v6 with executor-only framing. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ty probe, surgical bloat removals (-248 lines)
Three concurrent corrections shipped together:
1. REPORT v6 -- EXECUTOR-ONLY cost framing.
- Headline corrected from -13% (architect+executor bundled) to
-45% (executor-only), the number the operator actually pays
per run. Architect amortizes once across many runs of the
designed workflow.
- v5 was wrong by ~3x in the conservative direction; this
correction tightens the empirical claim.
- Predictability probe (/bin/bash.19, 3 explore dispatches, all fired
Haiku) documented and validates Cell G's OMIT tactic as
currently-not-broken on Copilot CLI -- not as recommended.
- v0.3.3 reframe rationale explained: 'bind explicitly for
PREDICTABILITY + PORTABILITY + AUDIT TRAIL' is the actual
discipline; OMIT was a Copilot-CLI-only tactic that worked
by accident. No re-run required because routing is identical
(TRIVIAL -> claude-haiku-4.5 on Copilot CLI today, declared
explicitly vs inherited).
- Per-technique attribution and multi-scenario variance
explicitly DEFERRED to follow-up PRs.
2. Genesis-audits-genesis pass -- surgical bloat removals.
- Auditor (Opus 4.7, single architect cell) applied the
genesis skill to audit the genesis corpus.
- Recommendations: -720 to -930 lines projected; -248 lines
applied this PR (other consolidations deferred for safety).
- Removals: stance prose compression, step 3.2 sub-block
collapse, copilot.md cost-pattern bindings -> table form,
scaffolding removal across 3 files, B12 CONSEQUENCE block
(pure restatement), war-story citation overcount.
- Full audit at dev/empirical-proof/audit-v0.3.3/removal-list.md
for follow-up consideration.
3. Continued v0.3.3 reframe propagation.
- examples/06-cost-aware-panel.md PROVENANCE WARNING and
dollar arithmetic removed (worked example reduced to
qualitative pattern citations; concrete numbers belong in
cost-economics-process.md step 6 template, not duplicated
inline per example).
Net corpus delta (this commit): -468 +311 = -157 lines (corpus
itself, not counting REPORT/probe/audit artifacts).
Cumulative PR corpus delta: now ~+1700 net (was +1946 before
audit).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… appendix Per operator feedback: the PR story is baseline vs head; intermediate corpus iterations (v0.3.1, v0.3.2) belong in an appendix, not the headline. Learnings preserved (RCA #1 lens fan-out leak, RCA #2 synth-heavy adjudicator leak, Cell E architecture diagram) — they are how the BIND-UP-WITHOUT-JUSTIFICATION and HEAVY ADJUDICATOR anti-patterns were discovered empirically rather than from first principles. Confounded 3-cell A/B/C history moved to Appendix B. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Corpus additions: - assets/architectural-patterns.md §A1 PANEL: add UNDIFFERENTIATED LENS BINDING anti-pattern. Forces per-lens CAPABILITY PROFILE enumeration before binding (cross-file reasoning? STAKES-weighted output? multi-step proof?). - assets/design-patterns.md §B12 BULK IDENTICAL BINDING variant: strengthened to fire in BOTH directions (bulk-UP and bulk-DOWN). Cost direction is not the anti-pattern; lack of per-element reasoning is. Per-element CAPABILITY PROFILE template added; cross-references A1 PANEL. REPORT additions: - New Per-technique attribution section: isolates B12 (-$2.16 preventative), A12 (-$3.95 active, dominant win), B13 ($14 defensive) from existing 4-cell D/E/F/G data via pairwise cell deltas. Honest framing: D->G -45% is mostly A12, not B12 (D accidentally inherited Haiku via harness default). - New v0.3.4 PER-LENS DIFFERENTIATION subsection: explains the corpus change and why uniform Haiku binding on the 5 advisory lenses remains correct for this skill (after enumeration) but would differ on a verdict-emitting skill. - Updated 'What this PR proves' (7 numbered points; per-technique + PER-LENS now included) and 'What this PR does NOT prove' (per-technique removed; B14/B15/B16 ablations + multi-scenario + cross-harness retained as explicit deferrals). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ED bindings Dispatched Opus 4.7 architect on a deliberately different skill type (release-notes-generator: A2 PIPELINE, 50-commit input, mixed sub-tasks). Same persona, same v0.3.4 corpus, different problem shape. Result: role-class distribution DIFFERENTIATED, not uniform. - 2 TRIVIAL (classifier E1, batched bug-fix one-liners E4) - 2 IMPLEMENTER (orchestrator E0, feature prose E2) - 2 REVIEWER (breaking-change prose E3, consistency pass E5) - 0 PLANNER Architect explicitly flagged BIND-UP-WITHOUT-JUSTIFICATION risk on E4: 'wrongly slap-binding [30 bug-fix one-liners] to sonnet under release-notes is user-facing would be BIND-UP-WITHOUT-JUSTIFICATION and would inflate the per-run cost by roughly 30 premium requests.' The v0.3.4 corpus discipline working in the wild. Predicted L-scenario cost: ~$0.18-0.25 per run. A12 GRADIENT savings vs flat-sonnet hypothetical: ~29 premium requests per run. This validates that the PR-review panel's uniform Haiku binding is NOT rubber-stamping; it is the discipline working correctly on uniform-profile inputs. When profiles are heterogeneous (this run), the discipline produces heterogeneous bindings. Handoff packet persisted at dev/empirical-proof/scenario-release-notes/. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…S discipline Dispatched v0.3+ panel against microsoft/apm#1541 (+41/-2, 2 files, small CLI fix). Results: - Executor cost: ~$0.21 (vs $2.85 on PR #1424) - Per-kLoC: $5.12 (vs $1.15 baseline) — fixed Sonnet-executor overhead dominates at small scale (~93% of total cost) - Panel cost shape holds in dollar terms; per-kLoC ratio inverted - Arbiter trigger correctly did NOT fire (0 BLOCKERs) KEY EMPIRICAL VALIDATION of v0.3.4 PER-LENS DIFFERENTIATION: The executor reflected per-lens against the CAPABILITY PROFILE template and concluded 4/5 lenses genuinely TRIVIAL, but security lens was INADEQUATE on TRIVIAL/Haiku — it surfaced a real MEDIUM bypass concern but could not validate it without out-of-diff function body access. This empirically generates the per-element justification the v0.3.4 corpus requires architects to record at design time. Recommended carve-out: 'Security lens uses Haiku when all referenced functions are in-diff; escalates to REVIEWER with tool access when it must reason about out-of-diff internals.' Implication for PR #1424: security lens was likely mis-bound to TRIVIAL. The blocker false-positive (_substitute_plugin_root alleged undefined, refuted only via out-of-diff gh api lookup) is consistent with TRIVIAL- class inadequacy on cross-file reasoning. A v0.3.4 re-architect would correctly bind security to REVIEWER, expected cost delta +$0.50-1.00 per run with measurable security finding fidelity improvement. REPORT updated with multi-scenario section (small-PR + different-skill). Deferral list narrowed: full S1-S5 × {v0.2,v0.3+} matrix remains follow-up. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Foundry docs - model-catalog.md: add 'researcher' as a sixth role class for open-ended exploration (distinct from planner which optimises a stated goal, and from long-context-retriever which extracts from a known corpus). Capability profile: open-ended exploration with discovery-phase success criteria, multi-source synthesis, multi-hypothesis reasoning. Cost profile: highest tier + reasoning multiplier. Requires narrow-trigger discipline (cited STAKES) to bind, mirroring A12 arbiter discipline. Pattern matching != research; if a rubric exists -> reviewer; if a plan exists -> planner. - model-catalog.md: add 'OpenAI / GPT-5 family specifics' section grounded against Microsoft Learn Azure Foundry reasoning docs. Documents reasoning_effort values (none|minimal|low|medium|high|xhigh), defaults per SKU (gpt-5.1 defaults to none; gpt-5-pro defaults to and only supports high; xhigh only on models after gpt-5.1-codex-max), and the per-role- class binding table. Architect MUST declare reasoning_effort alongside SKU for OpenAI bindings (same SKU spans 2-3 role classes by effort). - model-catalog.md: refresh all EXAMPLES sections with current GPT-5 lineup (gpt-5.1, gpt-5-mini, gpt-5-pro, gpt-5-codex, gpt-5.1-codex-max) per Azure Foundry docs. - design-patterns.md §B12: add researcher + long-context-retriever to the role-class enumeration in the SELECTION RULE. - design-patterns.md §B16 EFFORT GOVERNOR: extend role-class to effort- level mapping table with researcher (high to xhigh) and long-context- retriever (low). Explicit anti-pattern: suppressing effort on researcher defeats the binding. Grounding source: https://learn.microsoft.com/azure/foundry/openai/how-to/reasoning Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve-audit) × v0.2 vs v0.3.5
Six background Opus 4.7 dispatches, ~$30 architect spend. No executor
runs in this probe.
Scenarios chosen to stress different patterns:
- S1 apm-triage-panel: PANEL with arbiter -> B12 PER-LENS, A12 discipline
- S2 bulk-api-rename: STAFFED PLAN + batched edits -> S7 + B15 (CodeAct)
- S3 dependency-cve-audit: PIPELINE + per-CVE fan-out -> RESEARCHER class
(v0.3.5), LONG-CONTEXT-RETRIEVER, multi-class binding
Three findings:
1. v0.2 architects identify the gap, can't fill it. All three v0.2 cells
enumerated cost-aware patterns they wanted but couldn't cite. The
taxonomy gap is real and felt by the architect persona.
2. v0.3.5 discipline produces heterogeneous outputs on heterogeneous
inputs. S2 binds 1 element (IMPLEMENTER, BIND DOWN). S3 exercises 5
of 6 role classes. S1 differentiates 3 TRIVIAL + 3 REVIEWER after
enumerating 6 capability profiles. 'All-Haiku' from PR #1424 review
was correct for that specific input, not the discipline's default.
3. RESEARCHER class fires once across three scenarios, exactly as
designed. Two of three v0.3.5 cells explicitly REJECT it with cited
rule ('rubric exists -> REVIEWER'). Only S3 binds it with full
STAKES citation. Narrow-trigger discipline working.
S2 carries a concrete per-technique number: tools: [read, execute]
structurally excludes the edit tool so the naive 50+-edit-turn
anti-pattern is impossible by construction. apply-rename.sh script
batches in one turn. Projected $0.05-0.10 per L run vs $0.50-1.20
naive (~10x saving from S7 + B15).
Artifacts: dev/empirical-proof/cross-scenario/{S1-triage,S2-rename,
S3-cve}-{v02,v035}/handoff.md (6 files, ~3700 lines total). Each
contains 3 mermaid diagrams, interface sketches, module composition
table, per-element model bindings, patterns cited, cost projections.
REPORT updated with new 'Cross-scenario architect A/B' section + the
'What this PR does NOT prove' deferrals narrowed (B15 attribution now
in scope, executor matrix still deferred).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7 cells dispatched on real fixtures (S1 triage panel; S2 bulk
rename). Each cell measured / modeled cost from sub-agent transcripts
against Anthropic published $/Mtok rates.
Per-pattern measured savings:
- B15 + S7 TOOL SUBSET + CodeAct: 75x on S2 ($3.97 -> $0.053)
- B12 PER-LENS ROUTING: 2.27x on S1 ($0.540 -> $0.238)
- B14 CAVEMAN BRIEF (proposed): 1.81x on severity lens, 75% verdict
agreement, only upward escalations
- B16 EFFORT GOVERNOR: 1.72x + quality control (prevents
TRIVIAL-lens severity inflation)
- B13 CACHE-AWARE PREFIX: modeled ~2.5x on cached portion
REPORT rewritten in Minto-pyramid form (headline -> three takeaways
-> per-pattern table -> why -> per-pattern attribution -> proposed
sub-pattern -> caveats). The bloated chronological REPORT is
preserved as APPENDIX-iterative-history.md.
PR body mirrors the new REPORT structure: someone who has not
worked on the experiment can read it top-to-bottom and get the
measurement, the why, the per-pattern numbers, and the caveats.
CAVEMAN sub-experiment surfaces a candidate B14b CAVEMAN BRIEF
sub-pattern for TRIVIAL-class classifiers: 45% input-token saving,
zero downward severity errors, two defensible upward escalations.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…idated) Promoted from the cost-attribution experiment's caveman cell: 44.6% input-token saving on TRIVIAL severity classification, 75% agreement with verbose brief, zero downward errors. Gated to TRIVIAL class only (REVIEWER+ keep prose). Includes: - WHEN gate (TRIVIAL + fixed schema output) - mechanism + anchoring rule (extreme-bucket grounding) - measured effect citation (cost-report.json reference) - CAVEMAN ON REVIEWER anti-pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add 'FinOps view: scenarios at a glance' table near top with per-scenario v0.2 vs v0.3.5 cost and links to underlying evidence - Add ablations sub-table linking each B-pat cell artifact - Add side-by-side mermaid diagrams for S1 (triage) and S2 (rename) showing v0.2 vs v0.3.5 architectures with pattern annotations on the exact nodes/edges where each pattern applies - Hyperlink every pattern citation to its definition in design-patterns.md / architectural-patterns.md - Hyperlink every measurement claim to its cost-report.json - Self-contained: reader can verify any number without leaving this document Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Changes:
- Fix mermaid render error: quote all node labels containing brackets,
pipes, parens (previous version used unquoted [read, execute] inside
cylinder shape, parser failed)
- Add 'For a reader new to this work' prose section explaining what
genesis is, the three cost drivers, and what models cost
- Refine all 4 architecture diagrams:
* Annotate architect model + reasoning effort + role
* Annotate every sub-agent with model AND why (e.g. why synthesizer
is Sonnet not Haiku, why executor is Sonnet not Haiku)
* Show exact sub-agent spawn count and tool call count per cell
* Show TRIVIAL/REVIEWER routing as labelled edges with dispatch
counts (24 TRIVIAL + 24 REVIEWER) instead of abstract symbols
* Show context-growth as explicit annotation on v0.2 S2 diagram
- Add per-scenario explanatory prose: what the workflow does, why
the architecture matters, why each model choice was made
- Hyperlink every pattern and every claim to its source
The report now reads top-to-bottom for someone with no prior context
on genesis, the patterns, or the experiment.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds dated empirical anchoring for Haiku / Sonnet / Opus role-class choices on coding and autonomous-loop boundaries. - model-catalog.md: two 3-line load triggers pointing to the new reference (file load + Routing-axes B12 citation). - references/benchmark-grounding.md: cross-benchmark table (SWE-bench Verified, Terminal-Bench 2.1, Vals Index), SWE-bench task-length bucket view, and SONNET-AVERSION / HAIKU-PROMOTION cited as WRONG-PRIMITIVE BINDING instances at B12. Architected by the genesis skill on Opus 4.7 (verified 2026-05-29). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nding refs - S2 replaced with doc-audit (4 cells: zero-Opus, zero-Sonnet, v0.2, v0.3.5) - Surface the +27% fan-out tax v0.3.5 pays on doc-audit honestly - Add Why-we-observe subsection explaining the TRIVIAL-surface threshold - Per-pattern table now shows B12 negative-net on doc-audit - NOT-proven section calls out missing fan-out threshold derivation - Appendix links new S2N cells + benchmark-grounding reference - Old bulk-rename relabeled Scenario 3 (extreme floor case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion)
Adds top-level REAL-TELEMETRY-RESULTS.md as ground-truth absolute-cost
report. All numbers harvested from cloud session_store events table
(real usage_input_tokens / output_tokens / cache_read_tokens /
cache_write_tokens), priced at Anthropic public rates.
Matrix: 3 scenarios x 4 cells (zero-opus, zero-sonnet, v0.2 architected,
v0.3.6 architected). Architect sessions (Opus 4.7) NOT counted per
profiling protocol -- architecting is amortised across runs.
Headline findings (all measured, see file for cell-level cost-reports):
- S3 bulk rename: v0.2 per-file-edit anti-pattern ($33.79) is 7x
v0.3.6 S7 TOOL BRIDGE design ($10.40). v0.3.6 pattern WORKS.
- S2 doc-audit: v0.3.6 architect chose MONOLITHIC (rejecting fan-out
via B12 SELECTION RULE); ties v0.2 ($9.80 vs $9.42).
- S1 PR review: v0.3.6 5-lens fan-out ($24.59) costs more than
zero-sonnet ($8.89) -- architecture buys reviewability and
class-routed quality, not absolute cost win.
- zero-opus is the real anti-pattern: 3-8x zero-sonnet cost with
marginal quality justification on these workloads.
Files added:
- dev/empirical-proof/REAL-TELEMETRY-RESULTS.md (new)
- dev/empirical-proof/PROFILING-PROTOCOL.md (new)
- dev/empirical-proof/scenario-runs/results/{S1,S2N,S3}-*-real/cost-report.json (12 new)
- dev/empirical-proof/scenario-runs/results/REAL-COSTS-summary.csv (new)
REPORT.md banner now points readers to the new file first, retains
the size-modeled study as a pattern-isolation ablation (ratios valid,
absolute $ understated 50-200x due to ignored tool-surface overhead).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…RESULTS Grade all 12 cell outputs against scenario-specific rubrics and compute $/quality-unit. Adds the buyer-side ROI framing the cost-only matrix was missing. Headline takeaways: - S3: zero-sonnet has best raw $/quality; v0.3.6 is 2x cost but eliminates the failure-mode tail (zero-opus failed task at $41, v0.2 burned 40 calls on what should be 1 shell call). - S2: zero-sonnet best raw $/quality; v0.3.6 ships verifier-confirmed precision (6/6 HIGH confirmed, 0 downgrades). - S1: v0.3.6 caught 2 supply-chain BLOCKER security findings the other 3 cells missed. On severity-weighted ROI it ties zero-sonnet; on raw $/quality it is 1.85x more expensive. Verdict reframe: v0.3.6 is insurance, not optimisation. Pays 1.5-2.5x on workloads where it does not produce unique findings; on workloads where it does (S1 supply-chain BLOCKERs, S3 anti-pattern rejection), avoided cost is 10-100x the architecture premium. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add formal ROI definition (ROI_raw, ROI_weighted, ROI_tail) with severity weights (BLOCKER=5, HIGH=3, MED=1, LOW=0.3) and a per-cell 12-row scorecard at the top of both REPORT.md and REAL-TELEMETRY-RESULTS.md. Restructure prose grading to live below the scorecard. Buyer-facing math: - ROI_raw = QualityScore / Cost - ROI_weighted = (Sum severity_weight x findings) / Cost - ROI_tail = QualityScore / (Cost + P(failure) x C_failure) Per-cell scorecard surfaces: - S3-zero-opus ROI_raw 0.05 (failed task at $41) - S3-zero-sonnet best ROI_raw at 2.08 - S1-v0.3.6 weighted=56pts (2x next best) — only cell catching 2 supply-chain BLOCKERs Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds concrete per-cell evidence for the dispatch-overhead claim: - First-turn input by harness: 54K (Sonnet) / 75K (Opus) - Per-spawn entry tax by agent type: 6K Haiku, 35K Sonnet sub, 54K orch - Rolling re-send pattern (S3-v0.2): 54K→130K input growth, 290:1 I/O ratio - I/O ratio table across all 10 cells with non-trivial telemetry Mining done from the events table assistant.usage stream of each cell chat session. Numbers are real, real-billed, and reproducible by re-querying with the chat_session_id committed in each cost-report.json. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three new sections answering buyer questions: - Per-pattern attribution: which patterns moved which dollars (S7 +B12+B15+A12 load-bearing; B13/B14b/B16 catalogued but not pulling weight yet) - v0.3.6 flaws and quick wins (v0.4 frontier — entry-tax visibility, dispatch coalescing A13, B14b CAVEMAN, B15 tier-aware Sonnet, loop-detector for B12) - Reliability and predictability — the dimension raw ROI misses (variance bounds, auditability, repeatability) — argues v0.3.6 is the only thing on the table for productionised/automated use Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactors the genesis skill so the B14b CAVEMAN BRIEF pattern actually operates inside the agentic workflows genesis designs. Telemetry showed v0.3.6 had 0/9 caveman briefs in S1 task() dispatches — caveman was a catalogue stub, not a runtime contributor. Core idea: AUDIENCE BOUNDARY (composition-substrate §7) — every artifact named INTERNAL or EXTERNAL; default INTERNAL=caveman, EXTERNAL=normal prose; boundary is artifact-audience, not agent-tier. Caveman was designed for human↔assistant prose; genesis spawns subagents the user never sees. INTERNAL traffic is safe and ideal for caveman; EXTERNAL traffic stays prose. Changes: - composition-substrate.md §7 AUDIENCE BOUNDARY (substrate-level rule) - design-patterns.md B14b expanded to canonical fidelity (drop list, preservation contract for code/paths/URLs/numbers/error strings, LITE/FULL/ULTRA intensities, auto-clarity exceptions, role-mode persistence, output-mode contract) - design-patterns.md B14c CAVEMAN CHANNEL (orchestrates the boundary across multi-spawn workflows) - SKILL.md: per-spawn declaration discipline + audience preamble - references/audience-boundary.md NEW (load-on-demand audience matrix) - assets/caveman-templates.md NEW (five brief templates + receipt schemas: severity, dup-oracle, label-picker, missing-info, style) - pattern-tradeoffs.md: caveman-on-external anti-pattern - refactor-patterns.md R6 AUDIENCE-BOUNDARY ENFORCE checklist - model-catalog.md TRIVIAL row → B14b/B14c cross-link ROI: concentrated on PANEL-shape workflows. S1-style ~43K input + ~14K output saved/run (~$0.34 uncached). S2/S3-shape workflows correctly gain 0; v0.3.7 does NOT pressure architects to invent spawns. Audit: files/caveman-audit.md Design plan: files/v037-design-plan.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three handoff packets produced by Opus + genesis v0.3.7 architect runs. These are the design-only outputs; execution will be done by separate Sonnet executor sessions to replicate the v0.3.6 two-session pattern (architect on opus, executor on sonnet) — apples-to-apples cost A/B. Packets: - S1-triage-v037/handoff.md: 5-spawn panel (CAVEMAN_FULL briefs) - S2N-v037/handoff.md: monolithic A9+A7 (0 spawns, refusal validated) - S3-rename-v037/handoff.md: S7 short-circuit (0 spawns) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three cold-start executor cells (S1, S2N, S3) committed with REAL TELEMETRY harvested from the cloud events table for executor chat sessions 82be8bfe / 41405aae / 99c365c2. Cost-report.json files are authoritative (executor-only, architect-cost out-of-scope per the established v0.3.6 methodology). Headline numbers (executor-only, real telemetry): S1-v0.3.7 $18.53 (vs v0.3.6 $24.59, -24.6%) quality 8/10 S2N-v0.3.7 $7.10 (vs v0.3.6 $9.80, -27.5%) quality 8/10 (parity) S3-v0.3.7 $14.30 (vs v0.3.6 $10.40, +37.5%) quality 10/10 (fixture-rebuild overhead) Caveman receipt fidelity (load-bearing v0.3.7 substrate validation): S1-v0.3.7 5/5 CAVEMAN_FULL on the wire (vs v0.3.6's 0/9 PROSE_LEAK) S2N/S3 0 spawns by design (AUDIENCE BOUNDARY refusal validated) RTR.md updated: - matrix table: v0.3.7 column added per scenario - per-cell scorecard: 3 v0.3.7 rows + summary - per-pattern attribution: B14b promoted to empirically-load-bearing - flaw 3 (B14b not invoked): marked RESOLVED - new section: v0.3.7 cold-start packet fidelity test PR body updated separately via gh pr edit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Headline
This PR makes token economics a first-class design dimension in the genesis skill (v0.2 → v0.3.6). A new vocabulary of patterns lets the architect reason about which model each sub-agent runs on, which tools it can call, and how big each prompt needs to be — not just whether the workflow is correct.
We validated this on three workloads against four execution shapes each (twelve cells total), with every dollar harvested from the Copilot cloud
eventstable — no estimates, no models. The headline is honest:v0.3.7 update (this PR's headline addition). The B14b CAVEMAN BRIEF + B14c AUDIENCE BOUNDARY substrate is now empirically load-bearing. A cold-start fidelity test (Opus architect → fresh Sonnet executor, packet-only, no skill loaded) produces 5/5 CAVEMAN_FULL spawn briefs on the wire for S1's 5-lens panel (vs v0.3.6's 0/9), and the cost effect at quality parity is −24.6% on S1, −27.5% on S2N with real telemetry. S3 substrate path is byte-identical to v0.3.6; its +37.5% cell number is documented fixture-rebuild overhead. Detail in § v0.3.7 cold-start packet fidelity test.
The next three sections explain why that conclusion is the right one, what the experiment actually measured, and how to read the per-cell numbers.
TL;DR — when to pick which
sedcall cost $4.81 on S3 — architecture is dead weight herePer-cell scorecard (raw $, quality, weighted ROI):
Higher ROI is better. ROIweighted uses BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3. Full ROI definition in
REPORT.md§ The ROI function.Why this experiment exists
Genesis is a skill an LLM-driven agent loads to architect other agentic workflows. When a team says "design us a PR-review panel" or "audit our docs corpus", the genesis architect decides how many sub-agents to spawn, which model each runs on, what tools each can call, and how to wire their outputs.
Before this PR, the architect optimised for correctness only. Cost was implicit, and that made the architect prone to two failure modes we kept observing in practice:
view + editloop for a 19-file rename instead of a singleperl -ishell invocation. The agent does the work; it just does it through the wrong tool surface.The patterns added in v0.3.6 give the architect explicit vocabulary for both: B12 SELECTION RULE (how to pick the right primitive), B15 TOOL SUBSET + S7 DETERMINISTIC TOOL BRIDGE (when a shell call replaces N file edits), A12 GRADIENT WORKFLOW + benchmark-grounded model catalog (Sonnet for the bulk, Opus only for narrow arbiter roles), and A1 PANEL + A7 ADVERSARIAL VERIFIER (multi-stream reviewability with precision evidence). The question this PR answers is: do those patterns actually move the needle in production-grade conditions, on real workloads, against honest baselines?
How we measured
Three scenarios, each run four ways:
npm testmust pass)The four cells per scenario are zero-opus and zero-sonnet (single-shot prompt to a frontier model, no architecture), v0.2 architected (workflow designed by the previous-version architect), and v0.3.6 architected (workflow designed by this PR's architect with the new pattern vocabulary).
Cost is the real-telemetry sum of
usage_input_tokens,usage_output_tokens,usage_cache_read_tokens, andusage_cache_write_tokensfor the executor session and any sub-agents it spawned, priced at Anthropic public rates. Per-cellcost-report.jsonfiles are committed alongside chat-session ids for replay. The architect session is not counted — architecting is amortised, infrequent infrastructure.Quality was graded by the orchestrator against scenario-specific rubrics: binary correctness plus tool-mechanic cleanliness for S3, real structural drift caught + verifier precision for S2, actionable bugs caught + severity calibration for S1. Full grading detail with the per-finding evidence is in
REAL-TELEMETRY-RESULTS.md.A buyer's decision is return on each dollar, not absolute spend, so we score every cell on three ROI axes — raw
QualityScore / Cost, severity-weighted (BLOCKER × 5, HIGH × 3, MEDIUM × 1, LOW × 0.3), and tail-risk-adjusted (QualityScore / (Cost + P(failure) × C_failure)). The first answers "what does the dollar buy?"; the second weights findings by impact-class; the third dominates when missing one finding costs orders of magnitude more than the run itself. Higher is better on all three. The full definition with inputs and caveats is inREPORT.md§ The ROI function.What we found — three findings, in plain language
1. Bad architecture costs more than no architecture
The S3 rename cell where the v0.2 architect designed a per-file
view + editloop cost $33.79 in real telemetry. The same rename, executed by Sonnet directly with onesedshell call, cost $4.81. Same outcome, 7× the cost, because every tool call round-trips the entire conversation context back to the model and a 19-file rename done one file at a time pays that overhead 19 times.The v0.3.6 architect, given the same brief, rejected the per-file design and chose a single
grep | xargs perl -ishell call (the S7 DETERMINISTIC TOOL BRIDGE pattern). It cost $10.40 — 3.2× cheaper than v0.2 and within 2× of zero-sonnet — and the architecture overhead bought bounded variance: no risk of the per-file loop trap, no risk of substituting the wrong symbol the way zero-opus did. This is the load-bearing claim of the PR: the new pattern vocabulary makes the architect reject wrong-tool-surface designs, and the gap is measured, not modelled.2. On simple workloads, single-shot Sonnet is the cost-rational baseline
On S2 doc-audit and S1 PR review, zero-sonnet wins on raw $/quality. Sonnet at $6.20 produced a 7/10 audit. The v0.3.6 architecture at $9.80 produced an 8/10 audit. Sonnet at $8.89 produced a 6/10 PR review. The v0.3.6 architecture at $24.59 produced a 9/10 review.
Architecture has a fixed overhead. Every sub-agent dispatch reloads the Copilot CLI tool descriptions, the system prompt, and the genesis skill bundle — roughly 80K input tokens per turn that the single-shot path does not pay. On workloads where one Sonnet pass already produces good output, that overhead is dead weight. The v0.3.6 architect knows this: its S2 doc-audit handoff explicitly chose monolithic single-session over fan-out, citing the B12 SELECTION RULE wrong-primitive-binding warning.
What architecture buys on these workloads is not lower mean cost. It is multi-stream reviewability (five independent lens verdicts instead of one prose blob), class-routed quality (Sonnet on reviewer lenses, Haiku on trivial ones), and — uniquely on v0.3.6 — adversarial verification. The S2-v0.3.6 cell is the only one with a verifier pass that confirmed 6/6 HIGH findings with zero downgrades. That is the kind of precision evidence a buyer cannot get from a single-shot prompt at any cost.
3. On production-critical workloads, severity weighting flips the verdict
The S1 PR review is the test case where ROI_raw and ROI_tail diverge. Three of the four cells produced reasonable reviews. Only v0.3.6 caught the two supply-chain security BLOCKERs — arbitrary command execution via plugin-controlled LSP config, and validation bypass via the raw-dict install path. zero-opus and zero-sonnet both missed both. v0.2 under-completed the workflow entirely (a 27-line YAML stub with three bullet points).
On raw $/quality, zero-sonnet still wins (0.67 vs v0.3.6's 0.37). On severity-weighted ROI, the gap closes (zero-sonnet 2.59 vs v0.3.6 2.28 weighted-points-per-dollar) — the BLOCKER × 5 weighting moves v0.3.6's 56 weighted points to nearly twice the next cell's. On tail-risk-adjusted ROI, v0.3.6 dominates: the cost of merging a supply-chain CVE into a published CLI is at minimum thousands of dollars in remediation and reputation, against an architecture premium of ~$15. The math says: pay the premium when C_failure ≫ Cost, do not pay it when it does not. That is precisely the conclusion the v0.3.6 architect's catalog now encodes.
The full per-cell numbers, including ROI_raw, ROI_weighted, and the failure-mode column for each cell, are in the scorecard at the top of
REPORT.md. Cost methodology is inPROFILING-PROTOCOL.md. Full grading detail with the per-finding evidence is inREAL-TELEMETRY-RESULTS.md.Per-pattern attribution — which patterns moved which dollars
model:declarationscost-report.jsonS1-v037-real/caveman-classification.mdThe honest summary: S7 + B12 + B15 + A12 + B14b (new in v0.3.7) are doing the work; B13 + B16 are catalogued but not yet pulling weight. Full attribution and v0.4 frontier (5 quick wins) in
REAL-TELEMETRY-RESULTS.md§ Per-pattern attribution and § v0.3.6 flaws and quick wins.Reliability and predictability — the dimension ROI does not capture
Raw $/quality scores a single cell on a single run. It misses three properties that matter once the question shifts from "which is cheapest for this one task?" to "which workflow do I want in a CI pipeline run nightly, or handed to a junior engineer?":
.agent.mdand adds tests, which is exactly what the architect already does.This shifts the buyer's calculation. For one-off use, zero-sonnet is fine. For productionised, repeatable, automatable use, v0.3.6 is the only thing on the table — because it is the only one that produces a workflow artifact at all. Detail in
REAL-TELEMETRY-RESULTS.md§ Reliability and predictability.What changed in the corpus
skills/genesis/assets/design-patterns.md: B12 SELECTION RULE, B13 CACHE-AWARE PREFIX, B14b CAVEMAN BRIEF, B15 TOOL SUBSET, B16 EFFORT GOVERNOR, S7 DETERMINISTIC TOOL BRIDGEdev/empirical-proof/— twelve runtime cells with real-telemetry validation, formal ROI definition, per-cell artifacts and chat-session ids for replayWhat this PR does NOT prove
This is a small, opportunistic study. We have n=3 scenarios and n=1 per cell. The conclusions above are about the shape of the cost-quality curve under v0.3.6 patterns, not a SWE-bench-grade benchmark. Specifically:
REAL-TELEMETRY-RESULTS.md.P(failure|cell)in ROI_tail is the empirical 0/1 of this single run, not a calibrated probability. The tail-risk argument relies on the structural claim thatC_failure ≫ Costfor security-critical workloads, not on a fitted failure-rate model.