feat(rust): wire CodeCompressor + Kompress into live-zone dispatch + gate (3/3)#1155
Open
RubenAAA wants to merge 10 commits into
Open
feat(rust): wire CodeCompressor + Kompress into live-zone dispatch + gate (3/3)#1155RubenAAA wants to merge 10 commits into
RubenAAA wants to merge 10 commits into
Conversation
added 2 commits
June 19, 2026 10:56
Ports `headroom.transforms.kompress_compressor` (the ModernBERT token compressor behind PlainText compression) to Rust — the last unported compressor alongside CodeCompressor. Engine only; live-zone dispatch wiring follows in a separate PR (matches how the other compressors landed: port + parity first, then wire). What it does: - Loads the trained `chopratejas/kompress-v2-base` ONNX model (the inference weights) + the `answerdotai/ModernBERT-base` tokenizer (a fine-tune reuses its base vocab). Runs the ONNX/proxy compression path: whitespace-split → 350-word chunks → pre-tokenized encode (is_split_into_words) → ONNX `final_scores` → max-score-per-word → keep `> 0.5` → join kept words. <10 words passes through. - Inference via `ort` (direct dep added here — first session-API consumer; unifies to the ORT instance fastembed/magika already vendor). Tokenization via the `tokenizers` crate. Both reproduce the Python `transformers`/onnxruntime path exactly. Parity (byte-exact against the Python reference): - Tokenizer `input_ids`/`word_ids` reproduce HF on pre-tokenized input. - ONNX scores match to ~1e-6 (far below the 0.5 keep threshold). - Kept-word set + joined output match byte-for-byte. - `KompressComparator` wired into the parity harness; 21 fixtures recorded via recorder.py (enable_ccr=False → deterministic output). `cargo run -p headroom-parity -- run --only kompress`: 21/21 matched. Model-gated: skips (not fails) when the model isn't in the HF cache. Tests: - crates/headroom-core/tests/kompress_parity.rs (byte-parity + passthrough) - module unit tests (config defaults, result helpers) - clippy clean, no regressions across the full parity suite. CCR offload of dropped words is left to the dispatcher (the `<<ccr:>>` convention), not this engine — the inline Python marker is intentionally not reproduced.
21 tasks
Contributor
PR governanceThis PR follows the template and is marked ready for human review. |
added 8 commits
June 19, 2026 13:10
…ape ONNX Three self-contained additions to the cache-only loader, all in kompress.rs: - Cross-platform HF cache resolution: resolve the model via HF_HUB_CACHE / HF_HOME / HOME / USERPROFILE (was $HOME-only), so from_cache finds the model on native Windows, not just Linux. - Load diagnostics: from_cache now logs WHY it defers (kompress_cache_miss / kompress_session_build_failed with the searched roots + the real session build error) instead of returning None silently — the #1 cause of an unexplained kompress_ready=false. - Static-shape ONNX support: detect a fixed input_ids seq dimension on the loaded model and right-pad each chunk to it (masked padding => identical scores); dynamic models keep their natural length at zero padding cost. Enables execution providers that cannot compile dynamic shapes (OpenVINO NPU). Parity 21/21 in both static and dynamic modes.
…harness Ports headroom.transforms.code_compressor (CodeAwareCompressor) to Rust on the same branch as the Kompress port, so one PR delivers both new compressors (SmartCrusher is already upstream). Engine (crates/headroom-core/src/transforms/code_compressor.rs): - tree-sitter AST parsing for python/js/ts/go/rust/java/c/cpp - language detection (regex prefilter -> fewest-errors tree-sitter) - symbol-importance scoring (min-max normalized, round-3 half-even) + body budget allocation, statement-level body truncation, omitted-line comments with call info, Python docstring handling (first_line/full/remove incl. multiline first-line reconstruction), syntax-validity guard (re-parse; return original on ERROR/MISSING). Grammar-version parity is the precondition: the Rust tree-sitter-<lang> crates are pinned to the exact versions of the Python wheels the fixtures were recorded against (same version number on crates.io + PyPI = same grammar source = identical ASTs). A canary over 9 samples x 8 languages confirmed node-for-node identical node-type + line-span trees at these pins. Parity: cargo run -p headroom-parity -- run --only code_aware_compressor -> 30/30 matched (18 non-trivial across all 8 languages + unknown + invalid-syntax fallback; all 3 docstring modes). py_round_int/py_round3 half-to-even verified against CPython. Integration test + record_code_compressor_fixtures.py mirror the Kompress harness. Fixtures recorded with enable_ccr=False + fallback_to_kompress=False for determinism; live-zone dispatch wiring (SourceCode slot) is a deliberate follow-up.
Mirrors the Python content_router so the two newly-ported compressors actually run in the proxy's live zone (they were engine-only before). - ContentType::SourceCode → CodeCompressor. The grammars are statically linked, so the singleton constructs in microseconds — a synchronous one-liner like Diff/Log/Search. Flips the existing `source_code_tool_result_routes_to_no_op` contract test (whose comment invited a future "wire it up" PR) to assert code_aware_compressor. - ContentType::PlainText → Kompress, loaded CACHE-ONLY. This mirrors the Python reference's `allow_download=False` preload path: never download on a hot/startup thread; when the ~261 MB model isn't in the local HF cache, yield None and pass the text through unchanged, exactly as Python does when Kompress is unavailable. New `Kompress::from_cache` constructor + `hf_cache_file` resolver back this. The PlainText routing test is model-gated (asserts strategy "kompress" when cached, passthrough when not), matching the kompress parity test's gating. - `warm_live_zone_compressors()` (exported) mirrors Python's `eager_load_compressors`: force the cache-only singletons off the request path. Proxy startup can call it; the lazy path works without it. No regressions: headroom-core 918 tests + 7 live_zone_dispatch tests pass, full parity unchanged (all comparators green), clippy clean across core/parity/proxy, workspace builds.
Kompress carries a ~261 MB ONNX model, so — unlike the always-on structural compressors and the AST CodeCompressor — it now loads only when an operator opts in, mirroring the Python reference's `config.enable_kompress`. - core: process-wide `KOMPRESS_ENABLED` (default off) + `set_kompress_enabled`. `kompress()` checks it before the OnceLock, so a disabled proxy never loads the model and PlainText passes through. `warm_live_zone_compressors` only warms Kompress when enabled. - proxy: `--enable-kompress` / `HEADROOM_PROXY_ENABLE_KOMPRESS` flag (default false), threaded through CliArgs → Config → for_test. main.rs sets the gate from config and, when enabled, fires a `spawn_blocking` cache-only warm-up off the request path (mirrors Python's `eager_load_compressors`) — a cold cache just leaves it deferred rather than stalling the bind. - the model-gated PlainText dispatch test enables the gate explicitly, like the proxy does at startup. No regressions: core 918 passed, proxy 407 passed, clippy clean across core/parity/proxy, workspace builds.
kompress() used OnceLock::get_or_init, so a request thread that hit a PlainText block while the ~261 MB model was still loading (or while a slow EP graph compile ran) blocked on that init and stalled the proxy. Make the request-path accessor a non-blocking OnceLock::get (None => pass through until ready); perform the one-time load only in warm_live_zone_compressors, off the request path.
ebb76ed to
80ce58e
Compare
13 tasks
Author
|
Rebased onto the updated #1153/#1154 and added |
JerrettDavis
requested changes
Jun 19, 2026
JerrettDavis
left a comment
Collaborator
There was a problem hiding this comment.
This PR is not merge-ready in current GitHub state (mergeStateStatus=UNSTABLE). Please update from current main, resolve any conflicts if present, and rerun/clear required CI before this can be approved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Wires the two ported compressors (Kompress + CodeCompressor) into the live-zone dispatcher and adds the Kompress opt-in gate. This is PR 3 of 3 splitting #1143 — the only behavior-changing piece.
dispatch_compressornow routesSourceCode -> CodeCompressorandPlainText -> Kompress, filling the slots it had reserved with TODOs.Stacked on PR 1 (Kompress) and PR 2 (CodeCompressor). Until those merge, this PR's diff against
mainalso includes their content; review #1 and #2 first. Once they land, this PR shrinks to the dispatch wiring + flag (~250 lines, mostly the routing test).Type of Change
Changes Made
headroom-core:transforms/live_zone.rs—dispatch_compressorroutesSourceCode -> CodeCompressor(always-on; grammars are statically linked, constructs in microseconds with no I/O) andPlainText -> Kompress(cache-only, gated). Addsset_kompress_enabled+warm_live_zone_compressors.headroom-core:transforms/kompress.rs— small additions to support the dispatcher's cache-only path.headroom-proxy:config.rs/main.rs—--enable-kompress/HEADROOM_PROXY_ENABLE_KOMPRESSflag (default off; Kompress carries a ~261 MB model so it is opt-in, mirroring the Pythonconfig.enable_kompress). Startup sets the gate and fires an off-request-path cache-only warm-up when enabled.tests: flipped/added live-zone dispatch routing tests.CHANGELOG.md: Unreleased entries for both compressors.Testing
pytest) — N/A for runtime (Rust change)ruff check .) — alsocargo clippy -D warningsmypy headroom) — N/A (Rust-only runtime change)Test Output
Real Behavior Proof
cargo test -p headroom-coreandcargo test -p headroom-proxy;cargo test -p headroom-core --test live_zone_dispatch; re-ran the parity harness to confirm wiring did not perturb output.PlainTextpasses through unchanged; with--enable-kompressand a cold cache, dispatch still passes plain text through (Kompress unavailable) rather than blocking.Review Readiness
Additional Notes
<<ccr:>>convention, so fixtures are recorded withenable_ccr=Falsefor determinism.pytest/mypychecklist items are N/A — Rust change; equivalent gates arecargo test+cargo clippy, run above.Update — non-blocking model load + rebase
Pushed
fix(kompress): never block the request path on model loadand rebased this branch onto the updated #1153/#1154.kompress()— the request-path accessor previously usedOnceLock::get_or_init, so a request thread that hit aPlainTextblock while the ~261 MB model was still loading (or while a slow EP graph compile ran — the OpenVINO NPU compile takes ~13s+) blocked on that init and stalled the proxy. It's now a non-blockingOnceLock::get(returnsNone⇒ pass through until ready); the one-time load happens only inwarm_live_zone_compressors, off the request path. So an enabled-but-not-yet-warm Kompress degrades to passthrough instead of hanging live traffic.from_cache/hf_cache_filemoved to the engine PR (feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153) where they belong; this branch's wiring commit reduces to the dispatch glue. Stack integrity intact (stack/1 ← stack/2 ← stack/3).Validated downstream against a live Intel NPU run: proxy stays responsive through the NPU compile, then
kompress_ready:trueand prose blocks compress on-device.cargo fmt/clippyclean, dispatch + core tests green.