fix(proxy): include system/tools/sampling in cache key#1473
Conversation
PR governanceThis PR follows the template and is marked ready for human review. |
f5f2013 to
a0ff63c
Compare
JerrettDavis
left a comment
There was a problem hiding this comment.
This fixes an important cache-contamination class, but the key is still not conservative enough for the non-streaming request body that the handlers forward upstream.
For OpenAI chat completions, two requests with the same messages/tools but different tool_choice, response_format, parallel_tool_calls, seed, presence_penalty, frequency_penalty, logit_bias, n, or similar response-shaping fields can still collide and serve the first response. For Anthropic, thinking is another material request field that is not included. The current tests prove the newly added fields, but not that the cache key covers the full forwarded behavior surface.
Please expand the cache-key snapshot to include the remaining forwarded fields that can affect generation, and add at least one regression test for a currently omitted field such as OpenAI tool_choice or response_format and Anthropic thinking. For this cache, it is better to be slightly too specific than to return a response generated under different request semantics.
|
Thanks for the review @JerrettDavis! |
SemanticCache hashed {model, messages} plus only a partial field set,
so two non-streaming requests with identical messages but a different
response-shaping field collided on one key and the second caller was
served the first's response, generated under different request
semantics. cache_enabled defaults True, so this fires by default.
Collapse _compute_key/get/set to **key_fields so each handler's
cache_key_fields snapshot is the single source of truth.
_strip_cache_control runs on every value (scalars pass through;
system/tools keep their cache_control canonicalization). Widen the
OpenAI snapshot with tool_choice, response_format, parallel_tool_calls,
seed, presence_penalty, frequency_penalty, logit_bias, n, logprobs,
top_logprobs, reasoning_effort, verbosity, and modalities; widen the
Anthropic snapshot with tool_choice, thinking, and output_config.
Transport/metadata fields and the deprecated functions API stay out.
Add an OpenAI handler integration test (this threading path had no
miss-direction coverage) and an Anthropic thinking test; expand the
cache-key unit params. Non-streaming path only.
a0ff63c to
e56515e
Compare
|
Thanks @JerrettDavis, good call. Widened the key to the full forwarded generation surface and added handler-level coverage. OpenAI now folds To avoid threading a growing arg list through three signatures, I collapsed On the test gap: a Deliberately left out: transport/metadata ( ruff, ruff format, mypy, and the cache suite are green. |
Description
SemanticCache._compute_key(headroom/proxy/semantic_cache.py) hashed only{model, messages}. The proxy cache is on by default (cache_enabled=True), sotwo non-streaming requests with identical messages but a different top-level
systemprompt (Anthropic), tool set, sampling config, or other response-shapingfield collided on one key and the second caller was served the first's cached
response — generated under different request semantics. Deterministic
cross-request contamination. Found during a proxy-cache audit; no existing issue
tracks it.
Type of Change
Changes Made
proxy/semantic_cache.py:_compute_key/get/setcollapsed to**key_fieldsso each handler'scache_key_fieldssnapshot is the singlesource of truth for what is in the key.
_strip_cache_controlruns on everyvalue (scalars pass through;
system/toolskeepcache_controlcanonicalization so a moved Claude Code breakpoint does not fragment the key).
Absent fields do not contribute, so truly-identical requests still hit.
proxy/handlers/anthropic.py: snapshot foldssystem,tools,tool_choice,temperature,top_p,top_k,max_tokens,stop(stop_sequences),thinking, andoutput_config.proxy/handlers/openai.py: snapshot foldstools,tool_choice,response_format,parallel_tool_calls,temperature,top_p,max_tokens/max_completion_tokens,stop,seed,presence_penalty,frequency_penalty,logit_bias,n,logprobs,top_logprobs,reasoning_effort,verbosity, andmodalities(reconciled against theOpenAPI
CreateChatCompletionRequestschema, not just the literal reviewlist). Each handler snapshots the fields once at the cache read (pre-upstream)
and reuses them at write, so a body mutated by the pipeline cannot diverge the
key (confirmed
body["tools"]is reassigned in the OpenAI handler).Excluded by design: transport/metadata (
stream,stream_options,store,user,service_tier,metadata), the deprecatedfunctions/function_callAPI, and audio-output fields (
audio,prediction) — this path is text traffic.Testing
pytest)ruff check .)mypy headroom)Test Output
Real Behavior Proof
/v1/messagesand/v1/chat/completionshandlers plus SemanticCache with a stubbed upstream (no live API call / credits).pytest tests/test_proxy_openai_cache_key_integration.py— for each newly added field (response_format,tool_choice,seed,reasoning_effort) it sends request A, then request B with the same messages and only that field changed, then request A again, asserting upstream call counts.thinkingcase behaves the same, and the full cache suite is 96 passed.not stream).Review Readiness
Checklist
Additional Notes
_compute_keyunit tests.**key_fieldscollapse means adding a future field is one line in the handler snapshot, with no change to the cache signature.if self.cache and not stream). Agent traffic is largely streaming, so impact is real but bounded — stated honestly rather than overclaimed.headroom/cache/semantic.py, the embeddings layer); it does not touchproxy/semantic_cache.py, so no overlap.--no-verify: the localmake ci-precheckpre-push hook fails on an unrelated Rust latency benchmark (classify_under_10us_per_call) that flakes under machine load. This is a Python-only change; CI runs the benchmark on clean hardware.