feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled#1139
Conversation
Adds `init_ort_ep()` to headroom-core, called at proxy startup before any ONNX session is created (fastembed EmbeddingScorer + magika Session share the same ORT singleton, so one global commit covers both). Set HEADROOM_ORT_EP=openvino to target Intel CPU/GPU/NPU via OpenVINO, or HEADROOM_ORT_EP=cuda for NVIDIA GPU. Unset/cpu is a no-op (ORT defaults to CPU). Failures are non-fatal: a warning is logged and ORT falls back to CPU automatically. ort 2.x uses libloading for dynamic EP dispatch — no compile-time feature flags needed. The direct ort dep is pinned to =2.0.0-rc.12 to unify with the singleton already in the tree via fastembed/magika.
…CACHE Default device is NPU; also accepts CPU, GPU, GPU.0, HETERO:NPU,GPU. Cache dir caches compiled NPU blobs so subsequent starts skip recompile.
…policy-enforcement disabled When --auth-mode-policy-enforcement is set to `disabled`, CompressionPolicy at request entry was already forced to PAYG, but compress_anthropic_request still received the raw classifier output (e.g. subscription). This caused the E1 tool-array sort, E2 schema-key sort, and E3 cache_control auto-placement guards inside normalize_tool_definitions to keep skipping even though the operator explicitly opted into the PAYG pipeline. Token reduction matters for subscription users too: rate limits are token-based, context windows are finite, and smaller payloads transfer faster. The billing-only framing of the original guard was too narrow. Fix: derive effective_auth_mode at the call site using the same enforcement-flag override already applied to CompressionPolicy, and pass that into compress_anthropic_request instead of the raw result.
PR governanceThis PR does not yet satisfy the required template fields:
Please update the PR body, or move the PR back to draft while it is still in progress. |
Tool input_schema and function.parameters are constant across requests, not per-request data. Skip ID-field checking inside these schema zones to eliminate false-positive warnings on request_id fields in Claude's tool schemas. Fixes: warnings like "volatile content in cached prefix will bust prompt-cache hits" for tools[i].input_schema.properties.request_id which are schema definitions, not volatile per-request values. Changes: - is_id_named_key() now accepts location path and returns false for ID fields inside input_schema.* or function.parameters.* - Updated call site in scan_value_recursive() to pass location - Updated tests: removed false-positive detection test, added realistic per-request data detection test
…oad docs The OpenVINO EP defaults to disable_dynamic_shapes=false, which forces the NPU graph compiler into dynamic-shape mode and hangs for minutes (direct OpenVINO compiles the same static graph in ~13s). Set with_dynamic_shapes(false) so it compiles for a fixed input shape; callers feed fixed-shape inputs. Also correct the init_ort_ep doc: the OpenVINO EP is NOT in a stock onnxruntime, and on Windows ort-load-dynamic silently loads System32\onnxruntime.dll (no OVEP) and runs on CPU while logging success. Documents the ORT_DYLIB_PATH + version-matched openvino fix.
|
Pushed one commit since your last look: |
JerrettDavis
left a comment
There was a problem hiding this comment.
Reviewed the ORT execution-provider and Phase E fixes. Startup initializes the ORT EP before sessions are created, OpenVINO device/cache settings are documented in code, the auth-mode override is applied at the compression call site, and the volatile-detector schema false positive is covered. CI is green.
Description
Two related changes to the Rust proxy:
1.
HEADROOM_ORT_EP— runtime execution provider selection for ONNX RuntimeAdds
init_ort_ep()toheadroom-core, called once at proxy startup before any ONNX session is created. The fastembedEmbeddingScorerand magikaSessionshare the same ORT singleton, so a single env var covers both.HEADROOM_ORT_EPcpu/openvino/cudacpu(no-op)HEADROOM_ORT_OPENVINO_DEVICENPU,CPU,GPU,GPU.0,HETERO:NPU,GPUNPUHEADROOM_ORT_OPENVINO_CACHEort 2.x uses
libloadingfor dynamic EP dispatch — all EP types are always compiled in; no compile-time feature flags are needed. The directortdep is pinned to=2.0.0-rc.12to unify with the singleton already in the tree via fastembed/magika. Failures are non-fatal: aWARNis logged and ORT falls back to CPU automatically.Windows note:
fastembedon Windows usesort-load-dynamic(already the case inCargo.toml) rather thanort-download-binaries-*to avoid DirectML link-time deps. OpenVINO is a Windows-native runtime; settingHEADROOM_ORT_EP=openvinoon Windows targets the Intel NPU via the installed OpenVINO redistributable.2. Fix Phase E passes skipped for non-PAYG auth with
--auth-mode-policy-enforcement disabledCompressionPolicyat request entry was already forced to PAYG when--auth-mode-policy-enforcement disabledis set, butcompress_anthropic_requeststill received the raw classifier output. This caused the E1 tool-array sort, E2 schema-key sort, and E3cache_controlauto-placement guards to keep skipping even though the operator explicitly opted into the PAYG pipeline.Token reduction matters for subscription users too — rate limits are token-based, context windows are finite, and smaller payloads transfer faster. The billing-only framing of the original guard was too narrow.
Fix: derive
effective_auth_modeat the call site using the same enforcement-flag override already applied toCompressionPolicy, and pass that intocompress_anthropic_requestinstead of the raw classifier result.Closes #
Type of Change
Changes Made
crates/headroom-core/src/lib.rs—pub fn init_ort_ep(): readsHEADROOM_ORT_EP, initialises OpenVINO or CUDA EP viaort::init().with_execution_providers([...]).commit()crates/headroom-core/Cargo.toml— directort = "=2.0.0-rc.12"dep (pinned,default-features = false) placed before any[target.cfg(...)]section to avoid TOML inheritancecrates/headroom-proxy/src/main.rs— callheadroom_core::init_ort_ep()afterinit_tracingand beforeAppState::newcrates/headroom-proxy/src/proxy.rs— deriveeffective_auth_modefrom the enforcement-flag override at thecompress_anthropic_requestcall siteTesting
pytest)ruff check .)mypy headroom)No new unit tests for the EP init path: the function is a startup side-effect with no return value and relies on ORT's own runtime presence checks. The auth-mode fix is covered by the existing
pr_e3_subscription_skips_auto_placementande1_passes_through_when_subscriptiontests, which assert the per-mode guard behaviour;--auth-mode-policy-enforcement disabledsits one layer above and is exercised end-to-end below.Test Output
Real Behavior Proof
Environment: Windows 11, Intel Core Ultra 9 285HX (Lunar Lake NPU), OpenVINO 2025.x installed,
headroom-proxy.execross-compiled from WSL2 (x86_64-pc-windows-gnu, Rust 1.95.0).Command:
Plus
HEADROOM_ORT_EP=openvinoin the environment.Observed result — startup:
{"level":"INFO","message":"ORT execution provider: OpenVINO","ep":"openvino","device":"NPU"} {"level":"INFO","message":"headroom-proxy starting","listen":"0.0.0.0:8787",...}Observed result — per request (Phase E now fires on subscription auth):
{"level":"INFO","message":"tool-array sort applied: tools reordered alphabetically by name","event":"e1_applied","tool_count":121} {"level":"INFO","message":"schema-key sort applied: input_schema keys rewritten in alphabetic order","event":"e2_applied","tool_count":121} {"level":"INFO","message":"customer-placed cache_control marker(s) present; auto-placement skipped","event":"e3_skipped","reason":"marker_present"} {"level":"INFO","message":"anthropic live-zone dispatch","frozen_message_count":0,"messages_total":13,"latest_user_message_index":"Some(12)","live_zone_blocks":1} {"level":"INFO","message":"compression applied","strategies":"[\"tool_array_sort\",\"schema_key_sort\"]"}Before this fix with
--auth-mode-policy-enforcement disabled, every request producede1_skipped / e2_skipped / e3_skippedwithreason: auth_modeandlive_zone_blocks: 0.What was not tested: CUDA EP path (no NVIDIA GPU available); Linux OpenVINO EP (NPU driver is Windows-only); behaviour under
--auth-mode-policy-enforcement enabled(unchanged code path, no regression expected).Review Readiness
Checklist
Update — NPU enablement (real-hardware validated, Intel AI Boost)
Pushed one commit (
fix(openvino): disable EP dynamic shapes (NPU compile) + correct EP-load docs) after validating this EP path end-to-end on an Intel NPU running a live workload:with_dynamic_shapes(false)— the OVEP defaults todisable_dynamic_shapes=false, which forces the NPU graph compiler into dynamic-shape mode and hangs for minutes. Direct OpenVINO compiles the same static graph in ~13s; with this flag the proxy's NPU compile completes (~13–20s) and caches a blob. Callers must feed fixed-shape inputs (the Kompress static-shape model in feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153 pads each chunk to a fixed length).init_ort_ep— the previous comment claimed "all EP types are always compiled in … falls back to CPU automatically." That's wrong for OpenVINO: the OVEP is not in a stockonnxruntime. On Windowsort-load-dynamicsilently loadsC:\Windows\System32\onnxruntime.dll(no OVEP) and this call logs a misleading success while every session runs on CPU. The fix is to pointORT_DYLIB_PATHat an OpenVINO-enabledonnxruntimeand put a version-matchedopenvinoonPATH(mismatched versions →Error 127: procedure not found). The doc now states this.Verified:
ORT execution provider: OpenVINO device:NPU, noError 127, NPU compile blob written, Kompress prose compression running on the NPU in a live session.