Skip to content

feat(tokens): propagate cached + thoughts tokens through return contexts#2356

Open
cdbartholomew wants to merge 3 commits into
mainfrom
feat/propagate-cached-thoughts-tokens
Open

feat(tokens): propagate cached + thoughts tokens through return contexts#2356
cdbartholomew wants to merge 3 commits into
mainfrom
feat/propagate-cached-thoughts-tokens

Conversation

@cdbartholomew

Copy link
Copy Markdown
Contributor

Summary

The Gemini 2.5+ family (and any future provider that combines prompt caching with reasoning tokens) reports four distinct token counts on every response:

  • prompt_token_count — total input
  • candidates_token_count — visible output
  • cached_content_token_count — subset of input served from prompt cache
  • thoughts_token_count — reasoning tokens, billed at the output rate by the provider

The provider already records the last two on the Prometheus counters hindsight.llm.tokens.cached_input and hindsight.llm.tokens.thoughts, but the values stop at the metrics layer. Every downstream return context (TokenUsage, LLMToolCallResult, TokenUsageSummary, RetainResult) only carries the top-level input/output split. That means:

  1. A metering extension can only attribute prompt-cache hit-rate globally (via Prometheus aggregates), not per-operation.
  2. Reasoning-token spend is invisible at the application layer because the provider keeps it out of candidates_token_count. A workload that "looks cheap" by visible-output volume can be silently expensive if the model is doing long reasoning chains.

This PR threads the two fields through end-to-end. Pure plumbing — no metrics renames, no gating, no flags.

Changes

  • TokenUsage — adds thoughts_tokens (cached_tokens already existed). __add__ sums both new fields so multi-iteration agentic-loop aggregation works.
  • LLMToolCallResult — adds cached_tokens + thoughts_tokens.
  • TokenUsageSummary (returned by run_reflect_agent) — adds both fields. run_reflect_agent accumulates them at every call site: main tool loop + structured-output extraction + 4 edge-case completion branches.
  • _generate_structured_output — return tuple grows from 3 to 5 (output, in, out, cached, thoughts); all 6 callers in the reflect agent updated together.
  • RetainResult — adds optional llm_cached_input_tokens + llm_thoughts_tokens. memory_engine populates them from aggregated TokenUsage. Defaults stay None so existing metering extensions are unaffected.
  • Gemini provider — already reads the four token counts from the SDK response; this change makes thoughts_tokens reach TokenUsage on the call path, and cached_input_tokens + thoughts_tokens reach LLMToolCallResult on the call_with_tools path.

Backward compatibility

Every new field defaults to 0 (or None for the RetainResult dataclass). Callers built before this change keep working. Provider impls that don't surface these counts simply propagate zeros — the structured Prometheus counters were already optional in record_llm_call.

Test plan

  • New focused test file test_token_usage_cached_thoughts.py (6 cases) pinning propagation through every return type and the __add__ aggregation behavior
  • All existing Gemini provider tests pass unchanged (test_gemini_batch.py, test_gemini_cache.py, test_gemini_service_tier.py — 50 tests)
  • All existing reflect tests pass unchanged (test_reflect_prompt_builder.py, test_reflect_internal_billing.py, test_reflect_empty_based_on.py, test_reflect_source_facts_config.py — 37 tests)
  • CI green

Why now

Without these fields exposed at the application layer, the only way to attribute prompt-cache hit-rate or reasoning spend per operation is to JOIN Prometheus aggregates against application-level usage records by timestamp + tenant — which is lossy and only works for global trends, not per-customer / per-operation insight. Propagating the values through the existing return contexts removes that gap with a small, fully-backward-compatible change.

The Gemini 2.5+ family (and any future provider that combines prompt caching
with reasoning tokens) reports four distinct token counts on every response:

  - prompt_token_count        (total input)
  - candidates_token_count    (visible output)
  - cached_content_token_count (subset of input served from prompt cache)
  - thoughts_token_count      (reasoning tokens, billed at output rate)

The provider already records the last two on the Prometheus
``hindsight.llm.tokens.{cached_input,thoughts}`` counters, but the values
stop at the metrics layer — every return context (TokenUsage,
LLMToolCallResult, TokenUsageSummary, RetainResult) only exposes the
top-level input/output split. As a result:

  * a downstream metering extension can't attribute prompt-cache hit-rate
    per operation (only globally via Prometheus aggregates), and
  * reasoning-token spend is invisible to ``output_tokens`` because the
    provider keeps it out of candidates_token_count. A workload that
    "looks cheap" by visible output can be silently expensive if the
    model is doing long reasoning chains.

This change threads the two fields through end-to-end:

  - ``TokenUsage`` gains ``thoughts_tokens`` (cached_tokens already
    existed); ``__add__`` sums it so multi-iteration agentic-loop
    aggregation works.
  - ``LLMToolCallResult`` gains ``cached_tokens`` + ``thoughts_tokens``.
  - ``TokenUsageSummary`` (returned by ``run_reflect_agent``) gains
    both fields and ``run_reflect_agent`` accumulates them at every
    call site (main tool loop + structured-output extraction + 4
    edge-case completion branches).
  - ``_generate_structured_output`` now returns a 5-tuple
    ``(output, in, out, cached, thoughts)``; the 6 unpack sites in the
    reflect agent are updated together.
  - ``RetainResult`` gains optional ``llm_cached_input_tokens`` and
    ``llm_thoughts_tokens`` fields; ``memory_engine`` populates them
    from the aggregated ``TokenUsage``. Defaults stay ``None`` for
    engines that don't surface the data so existing metering extensions
    are unaffected.
  - The Gemini provider — which was already reading the four token
    counts from the SDK response — now returns ``thoughts_tokens`` on
    both the ``call`` and ``call_with_tools`` paths, and the existing
    ``cached_input_tokens`` value reaches ``LLMToolCallResult``.

Backward compatibility: every new field defaults to 0 (or None for the
RetainResult dataclass), so any caller built before this change keeps
working. Provider impls that don't surface these counts simply propagate
zeros — the structured Prometheus counters were already optional in
``record_llm_call``.

Adds focused tests (``test_token_usage_cached_thoughts.py``, 6 cases)
pinning the propagation through every return type and the aggregation
behavior. Existing reflect-agent + Gemini provider tests (87 cases) pass
unchanged.

This is a pure plumbing change — no metrics are renamed, no behavior is
gated, no flags are added.
Picks up the new TokenUsage.thoughts_tokens field added in the parent
commit. Generated by:

  ./scripts/generate-openapi.sh
  ./scripts/generate-clients.sh

Plus ``ruff format`` over the two reflect/ source files to match the
project's enforced formatting style.

No hand edits in any generated file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant