feat(tokens): propagate cached + thoughts tokens through return contexts#2356
Open
cdbartholomew wants to merge 3 commits into
Open
feat(tokens): propagate cached + thoughts tokens through return contexts#2356cdbartholomew wants to merge 3 commits into
cdbartholomew wants to merge 3 commits into
Conversation
The Gemini 2.5+ family (and any future provider that combines prompt caching
with reasoning tokens) reports four distinct token counts on every response:
- prompt_token_count (total input)
- candidates_token_count (visible output)
- cached_content_token_count (subset of input served from prompt cache)
- thoughts_token_count (reasoning tokens, billed at output rate)
The provider already records the last two on the Prometheus
``hindsight.llm.tokens.{cached_input,thoughts}`` counters, but the values
stop at the metrics layer — every return context (TokenUsage,
LLMToolCallResult, TokenUsageSummary, RetainResult) only exposes the
top-level input/output split. As a result:
* a downstream metering extension can't attribute prompt-cache hit-rate
per operation (only globally via Prometheus aggregates), and
* reasoning-token spend is invisible to ``output_tokens`` because the
provider keeps it out of candidates_token_count. A workload that
"looks cheap" by visible output can be silently expensive if the
model is doing long reasoning chains.
This change threads the two fields through end-to-end:
- ``TokenUsage`` gains ``thoughts_tokens`` (cached_tokens already
existed); ``__add__`` sums it so multi-iteration agentic-loop
aggregation works.
- ``LLMToolCallResult`` gains ``cached_tokens`` + ``thoughts_tokens``.
- ``TokenUsageSummary`` (returned by ``run_reflect_agent``) gains
both fields and ``run_reflect_agent`` accumulates them at every
call site (main tool loop + structured-output extraction + 4
edge-case completion branches).
- ``_generate_structured_output`` now returns a 5-tuple
``(output, in, out, cached, thoughts)``; the 6 unpack sites in the
reflect agent are updated together.
- ``RetainResult`` gains optional ``llm_cached_input_tokens`` and
``llm_thoughts_tokens`` fields; ``memory_engine`` populates them
from the aggregated ``TokenUsage``. Defaults stay ``None`` for
engines that don't surface the data so existing metering extensions
are unaffected.
- The Gemini provider — which was already reading the four token
counts from the SDK response — now returns ``thoughts_tokens`` on
both the ``call`` and ``call_with_tools`` paths, and the existing
``cached_input_tokens`` value reaches ``LLMToolCallResult``.
Backward compatibility: every new field defaults to 0 (or None for the
RetainResult dataclass), so any caller built before this change keeps
working. Provider impls that don't surface these counts simply propagate
zeros — the structured Prometheus counters were already optional in
``record_llm_call``.
Adds focused tests (``test_token_usage_cached_thoughts.py``, 6 cases)
pinning the propagation through every return type and the aggregation
behavior. Existing reflect-agent + Gemini provider tests (87 cases) pass
unchanged.
This is a pure plumbing change — no metrics are renamed, no behavior is
gated, no flags are added.
Picks up the new TokenUsage.thoughts_tokens field added in the parent commit. Generated by: ./scripts/generate-openapi.sh ./scripts/generate-clients.sh Plus ``ruff format`` over the two reflect/ source files to match the project's enforced formatting style. No hand edits in any generated file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Gemini 2.5+ family (and any future provider that combines prompt caching with reasoning tokens) reports four distinct token counts on every response:
prompt_token_count— total inputcandidates_token_count— visible outputcached_content_token_count— subset of input served from prompt cachethoughts_token_count— reasoning tokens, billed at the output rate by the providerThe provider already records the last two on the Prometheus counters
hindsight.llm.tokens.cached_inputandhindsight.llm.tokens.thoughts, but the values stop at the metrics layer. Every downstream return context (TokenUsage,LLMToolCallResult,TokenUsageSummary,RetainResult) only carries the top-level input/output split. That means:candidates_token_count. A workload that "looks cheap" by visible-output volume can be silently expensive if the model is doing long reasoning chains.This PR threads the two fields through end-to-end. Pure plumbing — no metrics renames, no gating, no flags.
Changes
TokenUsage— addsthoughts_tokens(cached_tokens already existed).__add__sums both new fields so multi-iteration agentic-loop aggregation works.LLMToolCallResult— addscached_tokens+thoughts_tokens.TokenUsageSummary(returned byrun_reflect_agent) — adds both fields.run_reflect_agentaccumulates them at every call site: main tool loop + structured-output extraction + 4 edge-case completion branches._generate_structured_output— return tuple grows from 3 to 5 (output, in, out, cached, thoughts); all 6 callers in the reflect agent updated together.RetainResult— adds optionalllm_cached_input_tokens+llm_thoughts_tokens.memory_enginepopulates them from aggregatedTokenUsage. Defaults stayNoneso existing metering extensions are unaffected.thoughts_tokensreachTokenUsageon thecallpath, andcached_input_tokens+thoughts_tokensreachLLMToolCallResulton thecall_with_toolspath.Backward compatibility
Every new field defaults to 0 (or
Nonefor the RetainResult dataclass). Callers built before this change keep working. Provider impls that don't surface these counts simply propagate zeros — the structured Prometheus counters were already optional inrecord_llm_call.Test plan
test_token_usage_cached_thoughts.py(6 cases) pinning propagation through every return type and the__add__aggregation behaviortest_gemini_batch.py,test_gemini_cache.py,test_gemini_service_tier.py— 50 tests)test_reflect_prompt_builder.py,test_reflect_internal_billing.py,test_reflect_empty_based_on.py,test_reflect_source_facts_config.py— 37 tests)Why now
Without these fields exposed at the application layer, the only way to attribute prompt-cache hit-rate or reasoning spend per operation is to JOIN Prometheus aggregates against application-level usage records by timestamp + tenant — which is lossy and only works for global trends, not per-customer / per-operation insight. Propagating the values through the existing return contexts removes that gap with a small, fully-backward-compatible change.