Symptom
After the agent-brain Run #1: `total_input_tokens: 812` over 220 invocations = 3.7 tokens per call. After Run #5: `total_input_tokens: 846` over 318 invocations = 2.7 tokens per call.
Across all eval runs, output tokens scale plausibly (~94K-100K, matching ~430 tokens per output) but input tokens are always ~3-4 per call — implausible given the 200+ token system prompts.
Where it shows up
`.designdoc-budget.json`:
```json
{
"cap_usd": 30.0,
"total_cost_usd": 29.97,
"total_input_tokens": 846, <-- implausibly low
"total_output_tokens": 100135,
"invocations": 318
}
```
`designdoc status` indirectly displays this via the cost line, but a future "tokens used / tokens remaining" feature would be wrong.
Likely cause
`runner.py:_DefaultSDK.query`:
```python
elif isinstance(msg, ResultMessage):
total_cost = msg.total_cost_usd or 0.0
usage = msg.usage or {}
input_tokens = usage.get("input_tokens", 0) if isinstance(usage, dict) else 0
```
The Anthropic SDK's `ResultMessage.usage` likely separates fresh input tokens from cached input tokens (prompt-cache hits get billed at ~10% the per-token rate). The current code reads only `usage["input_tokens"]` which may be the non-cached count. Cached tokens go to a separate field like `cache_read_input_tokens` or `cache_creation_input_tokens`.
If the SDK is using prompt caching aggressively (which is good for cost), then 95%+ of every call's input is cache reads, leaving only the dynamic delta in `input_tokens`. That matches what we see.
Proposed fix
```python
input_tokens = (
usage.get("input_tokens", 0)
+ usage.get("cache_read_input_tokens", 0)
+ usage.get("cache_creation_input_tokens", 0)
)
```
(Verify exact field names from the claude_agent_sdk usage dict.)
`total_cost_usd` is correct — the SDK reports billed cost directly. So this is purely a metering / display fix; the budget cap is enforced correctly.
Test
Add a unit test that constructs a fake usage dict including cache fields and asserts the runner counts the sum.
Closing criteria
- A real run shows input tokens scaling to ~hundreds-to-thousands per call (matching system prompt + user prompt size).
- Unit test verifies cache fields are summed.
Symptom
After the agent-brain Run #1: `total_input_tokens: 812` over 220 invocations = 3.7 tokens per call. After Run #5: `total_input_tokens: 846` over 318 invocations = 2.7 tokens per call.
Across all eval runs, output tokens scale plausibly (~94K-100K, matching ~430 tokens per output) but input tokens are always ~3-4 per call — implausible given the 200+ token system prompts.
Where it shows up
`.designdoc-budget.json`:
```json
{
"cap_usd": 30.0,
"total_cost_usd": 29.97,
"total_input_tokens": 846, <-- implausibly low
"total_output_tokens": 100135,
"invocations": 318
}
```
`designdoc status` indirectly displays this via the cost line, but a future "tokens used / tokens remaining" feature would be wrong.
Likely cause
`runner.py:_DefaultSDK.query`:
```python
elif isinstance(msg, ResultMessage):
total_cost = msg.total_cost_usd or 0.0
usage = msg.usage or {}
input_tokens = usage.get("input_tokens", 0) if isinstance(usage, dict) else 0
```
The Anthropic SDK's `ResultMessage.usage` likely separates fresh input tokens from cached input tokens (prompt-cache hits get billed at ~10% the per-token rate). The current code reads only `usage["input_tokens"]` which may be the non-cached count. Cached tokens go to a separate field like `cache_read_input_tokens` or `cache_creation_input_tokens`.
If the SDK is using prompt caching aggressively (which is good for cost), then 95%+ of every call's input is cache reads, leaving only the dynamic delta in `input_tokens`. That matches what we see.
Proposed fix
```python
input_tokens = (
usage.get("input_tokens", 0)
+ usage.get("cache_read_input_tokens", 0)
+ usage.get("cache_creation_input_tokens", 0)
)
```
(Verify exact field names from the claude_agent_sdk usage dict.)
`total_cost_usd` is correct — the SDK reports billed cost directly. So this is purely a metering / display fix; the budget cap is enforced correctly.
Test
Add a unit test that constructs a fake usage dict including cache fields and asserts the runner counts the sum.
Closing criteria