feat(tokens): propagate cached + thoughts tokens through return contexts by cdbartholomew · Pull Request #2356 · vectorize-io/hindsight

cdbartholomew · 2026-06-22T15:22:18Z

Summary

The Gemini 2.5+ family (and any future provider that combines prompt caching with reasoning tokens) reports four distinct token counts on every response:

prompt_token_count — total input
candidates_token_count — visible output
cached_content_token_count — subset of input served from prompt cache
thoughts_token_count — reasoning tokens, billed at the output rate by the provider

The provider already records the last two on the Prometheus counters hindsight.llm.tokens.cached_input and hindsight.llm.tokens.thoughts, but the values stop at the metrics layer. Every downstream return context (TokenUsage, LLMToolCallResult, TokenUsageSummary, RetainResult) only carries the top-level input/output split. That means:

A metering extension can only attribute prompt-cache hit-rate globally (via Prometheus aggregates), not per-operation.
Reasoning-token spend is invisible at the application layer because the provider keeps it out of candidates_token_count. A workload that "looks cheap" by visible-output volume can be silently expensive if the model is doing long reasoning chains.

This PR threads the two fields through end-to-end. Pure plumbing — no metrics renames, no gating, no flags.

Changes

TokenUsage — adds thoughts_tokens (cached_tokens already existed). __add__ sums both new fields so multi-iteration agentic-loop aggregation works.
LLMToolCallResult — adds cached_tokens + thoughts_tokens.
TokenUsageSummary (returned by run_reflect_agent) — adds both fields. run_reflect_agent accumulates them at every call site: main tool loop + structured-output extraction + 4 edge-case completion branches.
_generate_structured_output — return tuple grows from 3 to 5 (output, in, out, cached, thoughts); all 6 callers in the reflect agent updated together.
RetainResult — adds optional llm_cached_input_tokens + llm_thoughts_tokens. memory_engine populates them from aggregated TokenUsage. Defaults stay None so existing metering extensions are unaffected.
Gemini provider — already reads the four token counts from the SDK response; this change makes thoughts_tokens reach TokenUsage on the call path, and cached_input_tokens + thoughts_tokens reach LLMToolCallResult on the call_with_tools path.

Backward compatibility

Every new field defaults to 0 (or None for the RetainResult dataclass). Callers built before this change keep working. Provider impls that don't surface these counts simply propagate zeros — the structured Prometheus counters were already optional in record_llm_call.

Test plan

New focused test file test_token_usage_cached_thoughts.py (6 cases) pinning propagation through every return type and the __add__ aggregation behavior
All existing Gemini provider tests pass unchanged (test_gemini_batch.py, test_gemini_cache.py, test_gemini_service_tier.py — 50 tests)
All existing reflect tests pass unchanged (test_reflect_prompt_builder.py, test_reflect_internal_billing.py, test_reflect_empty_based_on.py, test_reflect_source_facts_config.py — 37 tests)
CI green

Why now

Without these fields exposed at the application layer, the only way to attribute prompt-cache hit-rate or reasoning spend per operation is to JOIN Prometheus aggregates against application-level usage records by timestamp + tenant — which is lossy and only works for global trends, not per-customer / per-operation insight. Propagating the values through the existing return contexts removes that gap with a small, fully-backward-compatible change.

The Gemini 2.5+ family (and any future provider that combines prompt caching with reasoning tokens) reports four distinct token counts on every response: - prompt_token_count (total input) - candidates_token_count (visible output) - cached_content_token_count (subset of input served from prompt cache) - thoughts_token_count (reasoning tokens, billed at output rate) The provider already records the last two on the Prometheus ``hindsight.llm.tokens.{cached_input,thoughts}`` counters, but the values stop at the metrics layer — every return context (TokenUsage, LLMToolCallResult, TokenUsageSummary, RetainResult) only exposes the top-level input/output split. As a result: * a downstream metering extension can't attribute prompt-cache hit-rate per operation (only globally via Prometheus aggregates), and * reasoning-token spend is invisible to ``output_tokens`` because the provider keeps it out of candidates_token_count. A workload that "looks cheap" by visible output can be silently expensive if the model is doing long reasoning chains. This change threads the two fields through end-to-end: - ``TokenUsage`` gains ``thoughts_tokens`` (cached_tokens already existed); ``__add__`` sums it so multi-iteration agentic-loop aggregation works. - ``LLMToolCallResult`` gains ``cached_tokens`` + ``thoughts_tokens``. - ``TokenUsageSummary`` (returned by ``run_reflect_agent``) gains both fields and ``run_reflect_agent`` accumulates them at every call site (main tool loop + structured-output extraction + 4 edge-case completion branches). - ``_generate_structured_output`` now returns a 5-tuple ``(output, in, out, cached, thoughts)``; the 6 unpack sites in the reflect agent are updated together. - ``RetainResult`` gains optional ``llm_cached_input_tokens`` and ``llm_thoughts_tokens`` fields; ``memory_engine`` populates them from the aggregated ``TokenUsage``. Defaults stay ``None`` for engines that don't surface the data so existing metering extensions are unaffected. - The Gemini provider — which was already reading the four token counts from the SDK response — now returns ``thoughts_tokens`` on both the ``call`` and ``call_with_tools`` paths, and the existing ``cached_input_tokens`` value reaches ``LLMToolCallResult``. Backward compatibility: every new field defaults to 0 (or None for the RetainResult dataclass), so any caller built before this change keeps working. Provider impls that don't surface these counts simply propagate zeros — the structured Prometheus counters were already optional in ``record_llm_call``. Adds focused tests (``test_token_usage_cached_thoughts.py``, 6 cases) pinning the propagation through every return type and the aggregation behavior. Existing reflect-agent + Gemini provider tests (87 cases) pass unchanged. This is a pure plumbing change — no metrics are renamed, no behavior is gated, no flags are added.

Picks up the new TokenUsage.thoughts_tokens field added in the parent commit. Generated by: ./scripts/generate-openapi.sh ./scripts/generate-clients.sh Plus ``ruff format`` over the two reflect/ source files to match the project's enforced formatting style. No hand edits in any generated file.

cdbartholomew added 3 commits June 22, 2026 11:21

chore: regenerate skills/hindsight-docs/references/openapi.json

e240a73

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tokens): propagate cached + thoughts tokens through return contexts#2356

feat(tokens): propagate cached + thoughts tokens through return contexts#2356
cdbartholomew wants to merge 3 commits into
mainfrom
feat/propagate-cached-thoughts-tokens

cdbartholomew commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cdbartholomew commented Jun 22, 2026

Summary

Changes

Backward compatibility

Test plan

Why now

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant