total_input_tokens count is implausibly low (~4 tokens/call); cost meter ledger is misleading

## Symptom

After the agent-brain Run #1: \`total_input_tokens: 812\` over **220 invocations** = 3.7 tokens per call. After Run #5: \`total_input_tokens: 846\` over 318 invocations = 2.7 tokens per call.

Across all eval runs, output tokens scale plausibly (~94K-100K, matching ~430 tokens per output) but input tokens are always ~3-4 per call — implausible given the 200+ token system prompts.

## Where it shows up

\`.designdoc-budget.json\`:

\`\`\`json
{
  "cap_usd": 30.0,
  "total_cost_usd": 29.97,
  "total_input_tokens": 846,    <-- implausibly low
  "total_output_tokens": 100135,
  "invocations": 318
}
\`\`\`

\`designdoc status\` indirectly displays this via the cost line, but a future "tokens used / tokens remaining" feature would be wrong.

## Likely cause

\`runner.py:_DefaultSDK.query\`:

\`\`\`python
elif isinstance(msg, ResultMessage):
    total_cost = msg.total_cost_usd or 0.0
    usage = msg.usage or {}
    input_tokens = usage.get("input_tokens", 0) if isinstance(usage, dict) else 0
\`\`\`

The Anthropic SDK's \`ResultMessage.usage\` likely separates **fresh input tokens** from **cached input tokens** (prompt-cache hits get billed at ~10% the per-token rate). The current code reads only \`usage["input_tokens"]\` which may be the *non-cached* count. Cached tokens go to a separate field like \`cache_read_input_tokens\` or \`cache_creation_input_tokens\`.

If the SDK is using prompt caching aggressively (which is good for cost), then 95%+ of every call's input is cache reads, leaving only the dynamic delta in \`input_tokens\`. That matches what we see.

## Proposed fix

\`\`\`python
input_tokens = (
    usage.get("input_tokens", 0)
    + usage.get("cache_read_input_tokens", 0)
    + usage.get("cache_creation_input_tokens", 0)
)
\`\`\`

(Verify exact field names from the claude_agent_sdk usage dict.)

\`total_cost_usd\` is correct — the SDK reports billed cost directly. So this is purely a metering / display fix; the budget cap is enforced correctly.

## Test

Add a unit test that constructs a fake usage dict including cache fields and asserts the runner counts the sum.

## Closing criteria

- A real run shows input tokens scaling to ~hundreds-to-thousands per call (matching system prompt + user prompt size).
- Unit test verifies cache fields are summed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

total_input_tokens count is implausibly low (~4 tokens/call); cost meter ledger is misleading #50

Symptom

Where it shows up

Likely cause

Proposed fix

Test

Closing criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

total_input_tokens count is implausibly low (~4 tokens/call); cost meter ledger is misleading #50

Description

Symptom

Where it shows up

Likely cause

Proposed fix

Test

Closing criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions