Skip to content

fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers#2378

Merged
nicoloboschi merged 2 commits into
vectorize-io:mainfrom
r266-tech:fix/openai-compat-reasoning-tokens
Jun 25, 2026
Merged

fix(openai): propagate reasoning_tokens into TokenUsage for OpenAI-compatible providers#2378
nicoloboschi merged 2 commits into
vectorize-io:mainfrom
r266-tech:fix/openai-compat-reasoning-tokens

Conversation

@r266-tech

Copy link
Copy Markdown
Contributor

Follow-up to merged #2356 (feat(tokens): propagate cached + thoughts tokens through return contexts).

Problem

#2356 shipped TokenUsage.thoughts_tokens "for cost attribution and prompt-cache tuning" and threaded it through retain_batch_asyncoperation_validator, but only the gemini provider actually populates it. The OpenAI-compatible backend — the most-used class (OpenAI o-series / gpt-5, groq, deepseek-r1, plus the NousLLM / FireworksLLM subclasses) — never reads completion_tokens_details.reasoning_tokens and never passes thoughts_tokens when building its return values:

  • call() extracts cached_tokens from usage.prompt_tokens_details but constructs TokenUsage(...) without thoughts_tokens → defaults to 0.
  • call_with_tools() builds LLMToolCallResult(...) with neither cached_tokens nor thoughts_tokens → both default to 0.

So for every OpenAI-compatible reasoning model, the brand-new cost-attribution field silently reports thoughts_tokens=0, and the gap widens as more reasoning models are used.

Fix

Mirror the existing cached_tokens extraction (and the gemini wiring) with a 0-safe getattr chain:

thoughts_tokens = 0
if usage and getattr(usage, "completion_tokens_details", None):
    thoughts_tokens = getattr(usage.completion_tokens_details, "reasoning_tokens", 0) or 0
  • call(): extract and pass thoughts_tokens into TokenUsage(...).
  • call_with_tools(): extract both cached_tokens and thoughts_tokens and pass them into LLMToolCallResult(...).

The getattr chain is 0-safe for providers that don't report completion_tokens_details (non-reasoning models, the Ollama native path), so they keep thoughts_tokens=0. NousLLM / FireworksLLM inherit the fix via subclassing.

Scoped to the OpenAI-compatible provider. anthropic_llm.py also omits thoughts_tokens, but Anthropic folds thinking into output_tokens with no separate reasoning sub-count, so it's left as an optional follow-up rather than bloating this PR.

Tests

tests/test_token_usage_cached_thoughts.py gains provider-level regression tests (mocking the OpenAI client usage object):

  • call() and call_with_tools() surface reasoning_tokensthoughts_tokens (and cached_tokens).
  • no-completion_tokens_details providers keep thoughts_tokens=0 (0-safe).

The existing model-level propagation tests are unchanged.

r266-tech and others added 2 commits June 24, 2026 01:08
…mpatible providers

Follow-up to merged vectorize-io#2356, which shipped TokenUsage.thoughts_tokens but only
wired the gemini provider. The OpenAI-compatible backend (the most-used class:
OpenAI o-series/gpt-5, groq, deepseek-r1, plus NousLLM/FireworksLLM subclasses)
never read completion_tokens_details.reasoning_tokens and never passed
thoughts_tokens, so it reported 0 for every OpenAI-compatible reasoning model.

Extract reasoning_tokens with a 0-safe getattr chain (mirroring the existing
cached_tokens extraction and the gemini wiring) in both call() and
call_with_tools(), and pass thoughts_tokens (plus cached_tokens for
call_with_tools) into TokenUsage / LLMToolCallResult. Providers without
completion_tokens_details (non-reasoning models, Ollama native) keep 0.

Scoped to the OpenAI-compatible provider; anthropic_llm.py folds thinking into
output_tokens with no separate reasoning sub-count, left as optional follow-up.
Adds provider-level regression tests for call() and call_with_tools().
…nt reasoning

OpenAI-compatible completion_tokens INCLUDES reasoning_tokens (verified live:
o4-mini completion=83, reasoning=64), but the TokenUsage contract and the
Gemini provider treat output_tokens/total_tokens as visible-only with
reasoning surfaced separately in thoughts_tokens. Subtract thoughts_tokens
from output_tokens (and total_tokens in call()) so cost attribution doesn't
double-count reasoning. Add a convention test pinning the invariant.
@nicoloboschi nicoloboschi merged commit bc81369 into vectorize-io:main Jun 25, 2026
85 checks passed
nicoloboschi added a commit that referenced this pull request Jun 25, 2026
…2378) (#2400)

PR #2378 added reasoning-token accounting in OpenAICompatibleLLM that
subtracts thoughts_tokens from output/total. Several tool-call tests build
their mock response with MagicMock() and set only prompt/completion/total
tokens, leaving usage.completion_tokens_details as a truthy auto-MagicMock.
The new code then does arithmetic on a MagicMock and raises TypeError,
failing all test-api shards. Set completion_tokens_details = None in the
affected mock helpers (matching the explicit-field convention already
documented in test_openrouter_null_content).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants