feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled by RubenAAA · Pull Request #1139 · chopratejas/headroom

RubenAAA · 2026-06-18T21:48:10Z

Description

Two related changes to the Rust proxy:

1. HEADROOM_ORT_EP — runtime execution provider selection for ONNX Runtime

Adds init_ort_ep() to headroom-core, called once at proxy startup before any ONNX session is created. The fastembed EmbeddingScorer and magika Session share the same ORT singleton, so a single env var covers both.

Variable	Values	Default
`HEADROOM_ORT_EP`	`cpu` / `openvino` / `cuda`	`cpu` (no-op)
`HEADROOM_ORT_OPENVINO_DEVICE`	`NPU`, `CPU`, `GPU`, `GPU.0`, `HETERO:NPU,GPU`	`NPU`
`HEADROOM_ORT_OPENVINO_CACHE`	path to blob cache dir	unset

ort 2.x uses libloading for dynamic EP dispatch — all EP types are always compiled in; no compile-time feature flags are needed. The direct ort dep is pinned to =2.0.0-rc.12 to unify with the singleton already in the tree via fastembed/magika. Failures are non-fatal: a WARN is logged and ORT falls back to CPU automatically.

Windows note: fastembed on Windows uses ort-load-dynamic (already the case in Cargo.toml) rather than ort-download-binaries-* to avoid DirectML link-time deps. OpenVINO is a Windows-native runtime; setting HEADROOM_ORT_EP=openvino on Windows targets the Intel NPU via the installed OpenVINO redistributable.

2. Fix Phase E passes skipped for non-PAYG auth with --auth-mode-policy-enforcement disabled

CompressionPolicy at request entry was already forced to PAYG when --auth-mode-policy-enforcement disabled is set, but compress_anthropic_request still received the raw classifier output. This caused the E1 tool-array sort, E2 schema-key sort, and E3 cache_control auto-placement guards to keep skipping even though the operator explicitly opted into the PAYG pipeline.

Token reduction matters for subscription users too — rate limits are token-based, context windows are finite, and smaller payloads transfer faster. The billing-only framing of the original guard was too narrow.

Fix: derive effective_auth_mode at the call site using the same enforcement-flag override already applied to CompressionPolicy, and pass that into compress_anthropic_request instead of the raw classifier result.

Closes #

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)

Changes Made

crates/headroom-core/src/lib.rs — pub fn init_ort_ep(): reads HEADROOM_ORT_EP, initialises OpenVINO or CUDA EP via ort::init().with_execution_providers([...]).commit()
crates/headroom-core/Cargo.toml — direct ort = "=2.0.0-rc.12" dep (pinned, default-features = false) placed before any [target.cfg(...)] section to avoid TOML inheritance
crates/headroom-proxy/src/main.rs — call headroom_core::init_ort_ep() after init_tracing and before AppState::new
crates/headroom-proxy/src/proxy.rs — derive effective_auth_mode from the enforcement-flag override at the compress_anthropic_request call site

Testing

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom)
Manual testing performed

No new unit tests for the EP init path: the function is a startup side-effect with no return value and relies on ORT's own runtime presence checks. The auth-mode fix is covered by the existing pr_e3_subscription_skips_auto_placement and e1_passes_through_when_subscription tests, which assert the per-mode guard behaviour; --auth-mode-policy-enforcement disabled sits one layer above and is exercised end-to-end below.

Test Output

cargo build --release -p headroom-proxy --target x86_64-pc-windows-gnu
Finished `release` profile [optimized] target(s) in 33.50s

Real Behavior Proof

Environment: Windows 11, Intel Core Ultra 9 285HX (Lunar Lake NPU), OpenVINO 2025.x installed, headroom-proxy.exe cross-compiled from WSL2 (x86_64-pc-windows-gnu, Rust 1.95.0).

Command:

headroom-proxy.exe \
  --upstream https://api.anthropic.com \
  --compression \
  --compression-mode live_zone \
  --auth-mode-policy-enforcement disabled \
  --cache-control-auto-frozen disabled

Plus HEADROOM_ORT_EP=openvino in the environment.

Observed result — startup:

{"level":"INFO","message":"ORT execution provider: OpenVINO","ep":"openvino","device":"NPU"}
{"level":"INFO","message":"headroom-proxy starting","listen":"0.0.0.0:8787",...}

Observed result — per request (Phase E now fires on subscription auth):

{"level":"INFO","message":"tool-array sort applied: tools reordered alphabetically by name","event":"e1_applied","tool_count":121}
{"level":"INFO","message":"schema-key sort applied: input_schema keys rewritten in alphabetic order","event":"e2_applied","tool_count":121}
{"level":"INFO","message":"customer-placed cache_control marker(s) present; auto-placement skipped","event":"e3_skipped","reason":"marker_present"}
{"level":"INFO","message":"anthropic live-zone dispatch","frozen_message_count":0,"messages_total":13,"latest_user_message_index":"Some(12)","live_zone_blocks":1}
{"level":"INFO","message":"compression applied","strategies":"[\"tool_array_sort\",\"schema_key_sort\"]"}

Before this fix with --auth-mode-policy-enforcement disabled, every request produced e1_skipped / e2_skipped / e3_skipped with reason: auth_mode and live_zone_blocks: 0.

What was not tested: CUDA EP path (no NVIDIA GPU available); Linux OpenVINO EP (NPU driver is Windows-only); behaviour under --auth-mode-policy-enforcement enabled (unchanged code path, no regression expected).

Review Readiness

I have performed a self-review
This PR is ready for human review

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have made corresponding changes to the documentation (CHANGELOG)
My changes generate no new warnings
I have updated the CHANGELOG.md

Update — NPU enablement (real-hardware validated, Intel AI Boost)

Pushed one commit (fix(openvino): disable EP dynamic shapes (NPU compile) + correct EP-load docs) after validating this EP path end-to-end on an Intel NPU running a live workload:

with_dynamic_shapes(false) — the OVEP defaults to disable_dynamic_shapes=false, which forces the NPU graph compiler into dynamic-shape mode and hangs for minutes. Direct OpenVINO compiles the same static graph in ~13s; with this flag the proxy's NPU compile completes (~13–20s) and caches a blob. Callers must feed fixed-shape inputs (the Kompress static-shape model in feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153 pads each chunk to a fixed length).
Doc correction in init_ort_ep — the previous comment claimed "all EP types are always compiled in … falls back to CPU automatically." That's wrong for OpenVINO: the OVEP is not in a stock onnxruntime. On Windows ort-load-dynamic silently loads C:\Windows\System32\onnxruntime.dll (no OVEP) and this call logs a misleading success while every session runs on CPU. The fix is to point ORT_DYLIB_PATH at an OpenVINO-enabled onnxruntime and put a version-matched openvino on PATH (mismatched versions → Error 127: procedure not found). The doc now states this.

Verified: ORT execution provider: OpenVINO device:NPU, no Error 127, NPU compile blob written, Kompress prose compression running on the NPU in a live session.

Adds `init_ort_ep()` to headroom-core, called at proxy startup before any ONNX session is created (fastembed EmbeddingScorer + magika Session share the same ORT singleton, so one global commit covers both). Set HEADROOM_ORT_EP=openvino to target Intel CPU/GPU/NPU via OpenVINO, or HEADROOM_ORT_EP=cuda for NVIDIA GPU. Unset/cpu is a no-op (ORT defaults to CPU). Failures are non-fatal: a warning is logged and ORT falls back to CPU automatically. ort 2.x uses libloading for dynamic EP dispatch — no compile-time feature flags needed. The direct ort dep is pinned to =2.0.0-rc.12 to unify with the singleton already in the tree via fastembed/magika.

…CACHE Default device is NPU; also accepts CPU, GPU, GPU.0, HETERO:NPU,GPU. Cache dir caches compiled NPU blobs so subsequent starts skip recompile.

…policy-enforcement disabled When --auth-mode-policy-enforcement is set to `disabled`, CompressionPolicy at request entry was already forced to PAYG, but compress_anthropic_request still received the raw classifier output (e.g. subscription). This caused the E1 tool-array sort, E2 schema-key sort, and E3 cache_control auto-placement guards inside normalize_tool_definitions to keep skipping even though the operator explicitly opted into the PAYG pipeline. Token reduction matters for subscription users too: rate limits are token-based, context windows are finite, and smaller payloads transfer faster. The billing-only framing of the original guard was too narrow. Fix: derive effective_auth_mode at the call site using the same enforcement-flag override already applied to CompressionPolicy, and pass that into compress_anthropic_request instead of the raw result.

github-actions · 2026-06-18T21:48:23Z

PR governance

This PR does not yet satisfy the required template fields:

Fill in Real Behavior Proof → Environment.
Fill in Real Behavior Proof → Exact command / steps.
Fill in Real Behavior Proof → Observed result.
Fill in Real Behavior Proof → Not tested.

Please update the PR body, or move the PR back to draft while it is still in progress.

Tool input_schema and function.parameters are constant across requests, not per-request data. Skip ID-field checking inside these schema zones to eliminate false-positive warnings on request_id fields in Claude's tool schemas. Fixes: warnings like "volatile content in cached prefix will bust prompt-cache hits" for tools[i].input_schema.properties.request_id which are schema definitions, not volatile per-request values. Changes: - is_id_named_key() now accepts location path and returns false for ID fields inside input_schema.* or function.parameters.* - Updated call site in scan_value_recursive() to pass location - Updated tests: removed false-positive detection test, added realistic per-request data detection test

…oad docs The OpenVINO EP defaults to disable_dynamic_shapes=false, which forces the NPU graph compiler into dynamic-shape mode and hangs for minutes (direct OpenVINO compiles the same static graph in ~13s). Set with_dynamic_shapes(false) so it compiles for a fixed input shape; callers feed fixed-shape inputs. Also correct the init_ort_ep doc: the OpenVINO EP is NOT in a stock onnxruntime, and on Windows ort-load-dynamic silently loads System32\onnxruntime.dll (no OVEP) and runs on CPU while logging success. Documents the ORT_DYLIB_PATH + version-matched openvino fix.

RubenAAA · 2026-06-19T11:15:31Z

Pushed one commit since your last look: with_dynamic_shapes(false) so the OpenVINO NPU actually compiles (it defaults to dynamic-shape compilation and hangs for minutes; static compiles in ~13s), plus a doc correction — the OVEP isn't in a stock onnxruntime, and on Windows ort-load-dynamic silently grabs System32's onnxruntime.dll (no OVEP) and runs on CPU while logging success. Validated end-to-end on an Intel AI Boost NPU in a live session. Details in the updated description.

JerrettDavis

Reviewed the ORT execution-provider and Phase E fixes. Startup initializes the ORT EP before sessions are created, OpenVINO device/cache settings are documented in code, the auth-mode override is applied at the compression call site, and the volatile-detector schema false positive is covered. CI is green.

Ruben Avanesov added 3 commits June 18, 2026 23:11

feat(ep): add HEADROOM_ORT_OPENVINO_DEVICE and HEADROOM_ORT_OPENVINO_…

e526dec

…CACHE Default device is NPU; also accepts CPU, GPU, GPU.0, HETERO:NPU,GPU. Cache dir caches compiled NPU blobs so subsequent starts skip recompile.

github-actions Bot added the status: needs author action Pull request body or readiness checklist still needs author updates label Jun 18, 2026

Ruben Avanesov added 2 commits June 19, 2026 00:16

RubenAAA mentioned this pull request Jun 19, 2026

feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153

Open

13 tasks

JerrettDavis approved these changes Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled#1139

feat(proxy): HEADROOM_ORT_EP for OpenVINO/CUDA EP selection; fix Phase E skips when auth-mode-policy-enforcement disabled#1139
RubenAAA wants to merge 5 commits into
chopratejas:mainfrom
RubenAAA:feat/openvino-ep

RubenAAA commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

RubenAAA commented Jun 19, 2026

Uh oh!

JerrettDavis left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

RubenAAA commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Changes Made

Testing

Test Output

Real Behavior Proof

Review Readiness

Checklist

Update — NPU enablement (real-hardware validated, Intel AI Boost)

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR governance

Uh oh!

RubenAAA commented Jun 19, 2026

Uh oh!

JerrettDavis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RubenAAA commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading