fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers#2378
Merged
nicoloboschi merged 2 commits intoJun 25, 2026
Conversation
…mpatible providers Follow-up to merged vectorize-io#2356, which shipped TokenUsage.thoughts_tokens but only wired the gemini provider. The OpenAI-compatible backend (the most-used class: OpenAI o-series/gpt-5, groq, deepseek-r1, plus NousLLM/FireworksLLM subclasses) never read completion_tokens_details.reasoning_tokens and never passed thoughts_tokens, so it reported 0 for every OpenAI-compatible reasoning model. Extract reasoning_tokens with a 0-safe getattr chain (mirroring the existing cached_tokens extraction and the gemini wiring) in both call() and call_with_tools(), and pass thoughts_tokens (plus cached_tokens for call_with_tools) into TokenUsage / LLMToolCallResult. Providers without completion_tokens_details (non-reasoning models, Ollama native) keep 0. Scoped to the OpenAI-compatible provider; anthropic_llm.py folds thinking into output_tokens with no separate reasoning sub-count, left as optional follow-up. Adds provider-level regression tests for call() and call_with_tools().
…nt reasoning OpenAI-compatible completion_tokens INCLUDES reasoning_tokens (verified live: o4-mini completion=83, reasoning=64), but the TokenUsage contract and the Gemini provider treat output_tokens/total_tokens as visible-only with reasoning surfaced separately in thoughts_tokens. Subtract thoughts_tokens from output_tokens (and total_tokens in call()) so cost attribution doesn't double-count reasoning. Add a convention test pinning the invariant.
nicoloboschi
added a commit
that referenced
this pull request
Jun 25, 2026
…2378) (#2400) PR #2378 added reasoning-token accounting in OpenAICompatibleLLM that subtracts thoughts_tokens from output/total. Several tool-call tests build their mock response with MagicMock() and set only prompt/completion/total tokens, leaving usage.completion_tokens_details as a truthy auto-MagicMock. The new code then does arithmetic on a MagicMock and raises TypeError, failing all test-api shards. Set completion_tokens_details = None in the affected mock helpers (matching the explicit-field convention already documented in test_openrouter_null_content).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to merged #2356 (
feat(tokens): propagate cached + thoughts tokens through return contexts).Problem
#2356 shipped
TokenUsage.thoughts_tokens"for cost attribution and prompt-cache tuning" and threaded it throughretain_batch_async→operation_validator, but only the gemini provider actually populates it. The OpenAI-compatible backend — the most-used class (OpenAI o-series / gpt-5, groq, deepseek-r1, plus theNousLLM/FireworksLLMsubclasses) — never readscompletion_tokens_details.reasoning_tokensand never passesthoughts_tokenswhen building its return values:call()extractscached_tokensfromusage.prompt_tokens_detailsbut constructsTokenUsage(...)withoutthoughts_tokens→ defaults to 0.call_with_tools()buildsLLMToolCallResult(...)with neithercached_tokensnorthoughts_tokens→ both default to 0.So for every OpenAI-compatible reasoning model, the brand-new cost-attribution field silently reports
thoughts_tokens=0, and the gap widens as more reasoning models are used.Fix
Mirror the existing
cached_tokensextraction (and the gemini wiring) with a 0-safe getattr chain:call(): extract and passthoughts_tokensintoTokenUsage(...).call_with_tools(): extract bothcached_tokensandthoughts_tokensand pass them intoLLMToolCallResult(...).The getattr chain is 0-safe for providers that don't report
completion_tokens_details(non-reasoning models, the Ollama native path), so they keepthoughts_tokens=0.NousLLM/FireworksLLMinherit the fix via subclassing.Scoped to the OpenAI-compatible provider.
anthropic_llm.pyalso omitsthoughts_tokens, but Anthropic folds thinking intooutput_tokenswith no separate reasoning sub-count, so it's left as an optional follow-up rather than bloating this PR.Tests
tests/test_token_usage_cached_thoughts.pygains provider-level regression tests (mocking the OpenAI clientusageobject):call()andcall_with_tools()surfacereasoning_tokens→thoughts_tokens(andcached_tokens).completion_tokens_detailsproviders keepthoughts_tokens=0(0-safe).The existing model-level propagation tests are unchanged.