Skip to content

total_input_tokens count is implausibly low (~4 tokens/call); cost meter ledger is misleading #50

@RichardHightower

Description

@RichardHightower

Symptom

After the agent-brain Run #1: `total_input_tokens: 812` over 220 invocations = 3.7 tokens per call. After Run #5: `total_input_tokens: 846` over 318 invocations = 2.7 tokens per call.

Across all eval runs, output tokens scale plausibly (~94K-100K, matching ~430 tokens per output) but input tokens are always ~3-4 per call — implausible given the 200+ token system prompts.

Where it shows up

`.designdoc-budget.json`:

```json
{
"cap_usd": 30.0,
"total_cost_usd": 29.97,
"total_input_tokens": 846, <-- implausibly low
"total_output_tokens": 100135,
"invocations": 318
}
```

`designdoc status` indirectly displays this via the cost line, but a future "tokens used / tokens remaining" feature would be wrong.

Likely cause

`runner.py:_DefaultSDK.query`:

```python
elif isinstance(msg, ResultMessage):
total_cost = msg.total_cost_usd or 0.0
usage = msg.usage or {}
input_tokens = usage.get("input_tokens", 0) if isinstance(usage, dict) else 0
```

The Anthropic SDK's `ResultMessage.usage` likely separates fresh input tokens from cached input tokens (prompt-cache hits get billed at ~10% the per-token rate). The current code reads only `usage["input_tokens"]` which may be the non-cached count. Cached tokens go to a separate field like `cache_read_input_tokens` or `cache_creation_input_tokens`.

If the SDK is using prompt caching aggressively (which is good for cost), then 95%+ of every call's input is cache reads, leaving only the dynamic delta in `input_tokens`. That matches what we see.

Proposed fix

```python
input_tokens = (
usage.get("input_tokens", 0)
+ usage.get("cache_read_input_tokens", 0)
+ usage.get("cache_creation_input_tokens", 0)
)
```

(Verify exact field names from the claude_agent_sdk usage dict.)

`total_cost_usd` is correct — the SDK reports billed cost directly. So this is purely a metering / display fix; the budget cap is enforced correctly.

Test

Add a unit test that constructs a fake usage dict including cache fields and asserts the runner counts the sum.

Closing criteria

  • A real run shows input tokens scaling to ~hundreds-to-thousands per call (matching system prompt + user prompt size).
  • Unit test verifies cache fields are summed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions