Skip to content

feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled#1139

Open
RubenAAA wants to merge 5 commits into
chopratejas:mainfrom
RubenAAA:feat/openvino-ep
Open

feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled#1139
RubenAAA wants to merge 5 commits into
chopratejas:mainfrom
RubenAAA:feat/openvino-ep

Conversation

@RubenAAA

@RubenAAA RubenAAA commented Jun 18, 2026

Copy link
Copy Markdown

Description

Two related changes to the Rust proxy:

1. HEADROOM_ORT_EP — runtime execution provider selection for ONNX Runtime

Adds init_ort_ep() to headroom-core, called once at proxy startup before any ONNX session is created. The fastembed EmbeddingScorer and magika Session share the same ORT singleton, so a single env var covers both.

Variable Values Default
HEADROOM_ORT_EP cpu / openvino / cuda cpu (no-op)
HEADROOM_ORT_OPENVINO_DEVICE NPU, CPU, GPU, GPU.0, HETERO:NPU,GPU NPU
HEADROOM_ORT_OPENVINO_CACHE path to blob cache dir unset

ort 2.x uses libloading for dynamic EP dispatch — all EP types are always compiled in; no compile-time feature flags are needed. The direct ort dep is pinned to =2.0.0-rc.12 to unify with the singleton already in the tree via fastembed/magika. Failures are non-fatal: a WARN is logged and ORT falls back to CPU automatically.

Windows note: fastembed on Windows uses ort-load-dynamic (already the case in Cargo.toml) rather than ort-download-binaries-* to avoid DirectML link-time deps. OpenVINO is a Windows-native runtime; setting HEADROOM_ORT_EP=openvino on Windows targets the Intel NPU via the installed OpenVINO redistributable.

2. Fix Phase E passes skipped for non-PAYG auth with --auth-mode-policy-enforcement disabled

CompressionPolicy at request entry was already forced to PAYG when --auth-mode-policy-enforcement disabled is set, but compress_anthropic_request still received the raw classifier output. This caused the E1 tool-array sort, E2 schema-key sort, and E3 cache_control auto-placement guards to keep skipping even though the operator explicitly opted into the PAYG pipeline.

Token reduction matters for subscription users too — rate limits are token-based, context windows are finite, and smaller payloads transfer faster. The billing-only framing of the original guard was too narrow.

Fix: derive effective_auth_mode at the call site using the same enforcement-flag override already applied to CompressionPolicy, and pass that into compress_anthropic_request instead of the raw classifier result.

Closes #

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)

Changes Made

  • crates/headroom-core/src/lib.rspub fn init_ort_ep(): reads HEADROOM_ORT_EP, initialises OpenVINO or CUDA EP via ort::init().with_execution_providers([...]).commit()
  • crates/headroom-core/Cargo.toml — direct ort = "=2.0.0-rc.12" dep (pinned, default-features = false) placed before any [target.cfg(...)] section to avoid TOML inheritance
  • crates/headroom-proxy/src/main.rs — call headroom_core::init_ort_ep() after init_tracing and before AppState::new
  • crates/headroom-proxy/src/proxy.rs — derive effective_auth_mode from the enforcement-flag override at the compress_anthropic_request call site

Testing

  • Unit tests pass (pytest)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • Manual testing performed

No new unit tests for the EP init path: the function is a startup side-effect with no return value and relies on ORT's own runtime presence checks. The auth-mode fix is covered by the existing pr_e3_subscription_skips_auto_placement and e1_passes_through_when_subscription tests, which assert the per-mode guard behaviour; --auth-mode-policy-enforcement disabled sits one layer above and is exercised end-to-end below.

Test Output

cargo build --release -p headroom-proxy --target x86_64-pc-windows-gnu
Finished `release` profile [optimized] target(s) in 33.50s

Real Behavior Proof

Environment: Windows 11, Intel Core Ultra 9 285HX (Lunar Lake NPU), OpenVINO 2025.x installed, headroom-proxy.exe cross-compiled from WSL2 (x86_64-pc-windows-gnu, Rust 1.95.0).

Command:

headroom-proxy.exe \
  --upstream https://api.anthropic.com \
  --compression \
  --compression-mode live_zone \
  --auth-mode-policy-enforcement disabled \
  --cache-control-auto-frozen disabled

Plus HEADROOM_ORT_EP=openvino in the environment.

Observed result — startup:

{"level":"INFO","message":"ORT execution provider: OpenVINO","ep":"openvino","device":"NPU"}
{"level":"INFO","message":"headroom-proxy starting","listen":"0.0.0.0:8787",...}

Observed result — per request (Phase E now fires on subscription auth):

{"level":"INFO","message":"tool-array sort applied: tools reordered alphabetically by name","event":"e1_applied","tool_count":121}
{"level":"INFO","message":"schema-key sort applied: input_schema keys rewritten in alphabetic order","event":"e2_applied","tool_count":121}
{"level":"INFO","message":"customer-placed cache_control marker(s) present; auto-placement skipped","event":"e3_skipped","reason":"marker_present"}
{"level":"INFO","message":"anthropic live-zone dispatch","frozen_message_count":0,"messages_total":13,"latest_user_message_index":"Some(12)","live_zone_blocks":1}
{"level":"INFO","message":"compression applied","strategies":"[\"tool_array_sort\",\"schema_key_sort\"]"}

Before this fix with --auth-mode-policy-enforcement disabled, every request produced e1_skipped / e2_skipped / e3_skipped with reason: auth_mode and live_zone_blocks: 0.

What was not tested: CUDA EP path (no NVIDIA GPU available); Linux OpenVINO EP (NPU driver is Windows-only); behaviour under --auth-mode-policy-enforcement enabled (unchanged code path, no regression expected).

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have made corresponding changes to the documentation (CHANGELOG)
  • My changes generate no new warnings
  • I have updated the CHANGELOG.md

Update — NPU enablement (real-hardware validated, Intel AI Boost)

Pushed one commit (fix(openvino): disable EP dynamic shapes (NPU compile) + correct EP-load docs) after validating this EP path end-to-end on an Intel NPU running a live workload:

  • with_dynamic_shapes(false) — the OVEP defaults to disable_dynamic_shapes=false, which forces the NPU graph compiler into dynamic-shape mode and hangs for minutes. Direct OpenVINO compiles the same static graph in ~13s; with this flag the proxy's NPU compile completes (~13–20s) and caches a blob. Callers must feed fixed-shape inputs (the Kompress static-shape model in feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153 pads each chunk to a fixed length).
  • Doc correction in init_ort_ep — the previous comment claimed "all EP types are always compiled in … falls back to CPU automatically." That's wrong for OpenVINO: the OVEP is not in a stock onnxruntime. On Windows ort-load-dynamic silently loads C:\Windows\System32\onnxruntime.dll (no OVEP) and this call logs a misleading success while every session runs on CPU. The fix is to point ORT_DYLIB_PATH at an OpenVINO-enabled onnxruntime and put a version-matched openvino on PATH (mismatched versions → Error 127: procedure not found). The doc now states this.

Verified: ORT execution provider: OpenVINO device:NPU, no Error 127, NPU compile blob written, Kompress prose compression running on the NPU in a live session.

Ruben Avanesov added 3 commits June 18, 2026 23:11
Adds `init_ort_ep()` to headroom-core, called at proxy startup before
any ONNX session is created (fastembed EmbeddingScorer + magika Session
share the same ORT singleton, so one global commit covers both).

Set HEADROOM_ORT_EP=openvino to target Intel CPU/GPU/NPU via OpenVINO,
or HEADROOM_ORT_EP=cuda for NVIDIA GPU. Unset/cpu is a no-op (ORT
defaults to CPU). Failures are non-fatal: a warning is logged and ORT
falls back to CPU automatically.

ort 2.x uses libloading for dynamic EP dispatch — no compile-time
feature flags needed. The direct ort dep is pinned to =2.0.0-rc.12
to unify with the singleton already in the tree via fastembed/magika.
…CACHE

Default device is NPU; also accepts CPU, GPU, GPU.0, HETERO:NPU,GPU.
Cache dir caches compiled NPU blobs so subsequent starts skip recompile.
…policy-enforcement disabled

When --auth-mode-policy-enforcement is set to `disabled`, CompressionPolicy
at request entry was already forced to PAYG, but compress_anthropic_request
still received the raw classifier output (e.g. subscription). This caused
the E1 tool-array sort, E2 schema-key sort, and E3 cache_control
auto-placement guards inside normalize_tool_definitions to keep skipping
even though the operator explicitly opted into the PAYG pipeline.

Token reduction matters for subscription users too: rate limits are
token-based, context windows are finite, and smaller payloads transfer
faster. The billing-only framing of the original guard was too narrow.

Fix: derive effective_auth_mode at the call site using the same
enforcement-flag override already applied to CompressionPolicy, and
pass that into compress_anthropic_request instead of the raw result.
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

PR governance

This PR does not yet satisfy the required template fields:

  • Fill in Real Behavior ProofEnvironment.
  • Fill in Real Behavior ProofExact command / steps.
  • Fill in Real Behavior ProofObserved result.
  • Fill in Real Behavior ProofNot tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

@github-actions github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 18, 2026
Ruben Avanesov added 2 commits June 19, 2026 00:16
Tool input_schema and function.parameters are constant across requests,
not per-request data. Skip ID-field checking inside these schema zones
to eliminate false-positive warnings on request_id fields in Claude's
tool schemas.

Fixes: warnings like "volatile content in cached prefix will bust
prompt-cache hits" for tools[i].input_schema.properties.request_id
which are schema definitions, not volatile per-request values.

Changes:
- is_id_named_key() now accepts location path and returns false
  for ID fields inside input_schema.* or function.parameters.*
- Updated call site in scan_value_recursive() to pass location
- Updated tests: removed false-positive detection test, added
  realistic per-request data detection test
…oad docs

The OpenVINO EP defaults to disable_dynamic_shapes=false, which forces
the NPU graph compiler into dynamic-shape mode and hangs for minutes
(direct OpenVINO compiles the same static graph in ~13s). Set
with_dynamic_shapes(false) so it compiles for a fixed input shape;
callers feed fixed-shape inputs.

Also correct the init_ort_ep doc: the OpenVINO EP is NOT in a stock
onnxruntime, and on Windows ort-load-dynamic silently loads
System32\onnxruntime.dll (no OVEP) and runs on CPU while logging
success. Documents the ORT_DYLIB_PATH + version-matched openvino fix.
@RubenAAA

Copy link
Copy Markdown
Author

Pushed one commit since your last look: with_dynamic_shapes(false) so the OpenVINO NPU actually compiles (it defaults to dynamic-shape compilation and hangs for minutes; static compiles in ~13s), plus a doc correction — the OVEP isn't in a stock onnxruntime, and on Windows ort-load-dynamic silently grabs System32's onnxruntime.dll (no OVEP) and runs on CPU while logging success. Validated end-to-end on an Intel AI Boost NPU in a live session. Details in the updated description.

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the ORT execution-provider and Phase E fixes. Startup initializes the ORT EP before sessions are created, OpenVINO device/cache settings are documented in code, the auth-mode override is applied at the compression call site, and the volatile-detector schema false positive is covered. CI is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs author action Pull request body or readiness checklist still needs author updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants