fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers by r266-tech · Pull Request #2378 · vectorize-io/hindsight

r266-tech · 2026-06-23T17:09:00Z

Follow-up to merged #2356 (feat(tokens): propagate cached + thoughts tokens through return contexts).

Problem

#2356 shipped TokenUsage.thoughts_tokens "for cost attribution and prompt-cache tuning" and threaded it through retain_batch_async → operation_validator, but only the gemini provider actually populates it. The OpenAI-compatible backend — the most-used class (OpenAI o-series / gpt-5, groq, deepseek-r1, plus the NousLLM / FireworksLLM subclasses) — never reads completion_tokens_details.reasoning_tokens and never passes thoughts_tokens when building its return values:

call() extracts cached_tokens from usage.prompt_tokens_details but constructs TokenUsage(...) without thoughts_tokens → defaults to 0.
call_with_tools() builds LLMToolCallResult(...) with neither cached_tokens nor thoughts_tokens → both default to 0.

So for every OpenAI-compatible reasoning model, the brand-new cost-attribution field silently reports thoughts_tokens=0, and the gap widens as more reasoning models are used.

Fix

Mirror the existing cached_tokens extraction (and the gemini wiring) with a 0-safe getattr chain:

thoughts_tokens = 0
if usage and getattr(usage, "completion_tokens_details", None):
    thoughts_tokens = getattr(usage.completion_tokens_details, "reasoning_tokens", 0) or 0

call(): extract and pass thoughts_tokens into TokenUsage(...).
call_with_tools(): extract both cached_tokens and thoughts_tokens and pass them into LLMToolCallResult(...).

The getattr chain is 0-safe for providers that don't report completion_tokens_details (non-reasoning models, the Ollama native path), so they keep thoughts_tokens=0. NousLLM / FireworksLLM inherit the fix via subclassing.

Scoped to the OpenAI-compatible provider. anthropic_llm.py also omits thoughts_tokens, but Anthropic folds thinking into output_tokens with no separate reasoning sub-count, so it's left as an optional follow-up rather than bloating this PR.

Tests

tests/test_token_usage_cached_thoughts.py gains provider-level regression tests (mocking the OpenAI client usage object):

call() and call_with_tools() surface reasoning_tokens → thoughts_tokens (and cached_tokens).
no-completion_tokens_details providers keep thoughts_tokens=0 (0-safe).

The existing model-level propagation tests are unchanged.

…mpatible providers Follow-up to merged vectorize-io#2356, which shipped TokenUsage.thoughts_tokens but only wired the gemini provider. The OpenAI-compatible backend (the most-used class: OpenAI o-series/gpt-5, groq, deepseek-r1, plus NousLLM/FireworksLLM subclasses) never read completion_tokens_details.reasoning_tokens and never passed thoughts_tokens, so it reported 0 for every OpenAI-compatible reasoning model. Extract reasoning_tokens with a 0-safe getattr chain (mirroring the existing cached_tokens extraction and the gemini wiring) in both call() and call_with_tools(), and pass thoughts_tokens (plus cached_tokens for call_with_tools) into TokenUsage / LLMToolCallResult. Providers without completion_tokens_details (non-reasoning models, Ollama native) keep 0. Scoped to the OpenAI-compatible provider; anthropic_llm.py folds thinking into output_tokens with no separate reasoning sub-count, left as optional follow-up. Adds provider-level regression tests for call() and call_with_tools().

…nt reasoning OpenAI-compatible completion_tokens INCLUDES reasoning_tokens (verified live: o4-mini completion=83, reasoning=64), but the TokenUsage contract and the Gemini provider treat output_tokens/total_tokens as visible-only with reasoning surfaced separately in thoughts_tokens. Subtract thoughts_tokens from output_tokens (and total_tokens in call()) so cost attribution doesn't double-count reasoning. Add a convention test pinning the invariant.

…2378) (#2400) PR #2378 added reasoning-token accounting in OpenAICompatibleLLM that subtracts thoughts_tokens from output/total. Several tool-call tests build their mock response with MagicMock() and set only prompt/completion/total tokens, leaving usage.completion_tokens_details as a truthy auto-MagicMock. The new code then does arithmetic on a MagicMock and raises TypeError, failing all test-api shards. Set completion_tokens_details = None in the affected mock helpers (matching the explicit-field convention already documented in test_openrouter_null_content).

r266-tech and others added 2 commits June 24, 2026 01:08

nicoloboschi merged commit bc81369 into vectorize-io:main Jun 25, 2026
85 checks passed

nicoloboschi mentioned this pull request Jun 25, 2026

test(openai): fix tool-call mocks crashing reasoning-token accounting (#2378) #2400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers#2378

fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers#2378
nicoloboschi merged 2 commits into
vectorize-io:mainfrom
r266-tech:fix/openai-compat-reasoning-tokens

r266-tech commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

r266-tech commented Jun 23, 2026

Problem

Fix

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants