Skip to content

feat(rust): wire CodeCompressor + Kompress into live-zone dispatch + gate (3/3)#1155

Open
RubenAAA wants to merge 10 commits into
chopratejas:mainfrom
RubenAAA:stack/3-dispatch-gate
Open

feat(rust): wire CodeCompressor + Kompress into live-zone dispatch + gate (3/3)#1155
RubenAAA wants to merge 10 commits into
chopratejas:mainfrom
RubenAAA:stack/3-dispatch-gate

Conversation

@RubenAAA

@RubenAAA RubenAAA commented Jun 19, 2026

Copy link
Copy Markdown

Description

Wires the two ported compressors (Kompress + CodeCompressor) into the live-zone dispatcher and adds the Kompress opt-in gate. This is PR 3 of 3 splitting #1143 — the only behavior-changing piece. dispatch_compressor now routes SourceCode -> CodeCompressor and PlainText -> Kompress, filling the slots it had reserved with TODOs.

Stacked on PR 1 (Kompress) and PR 2 (CodeCompressor). Until those merge, this PR's diff against main also includes their content; review #1 and #2 first. Once they land, this PR shrinks to the dispatch wiring + flag (~250 lines, mostly the routing test).

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • headroom-core: transforms/live_zone.rsdispatch_compressor routes SourceCode -> CodeCompressor (always-on; grammars are statically linked, constructs in microseconds with no I/O) and PlainText -> Kompress (cache-only, gated). Adds set_kompress_enabled + warm_live_zone_compressors.
  • headroom-core: transforms/kompress.rs — small additions to support the dispatcher's cache-only path.
  • headroom-proxy: config.rs / main.rs--enable-kompress / HEADROOM_PROXY_ENABLE_KOMPRESS flag (default off; Kompress carries a ~261 MB model so it is opt-in, mirroring the Python config.enable_kompress). Startup sets the gate and fires an off-request-path cache-only warm-up when enabled.
  • tests: flipped/added live-zone dispatch routing tests.
  • CHANGELOG.md: Unreleased entries for both compressors.

Testing

  • Unit tests pass (pytest) — N/A for runtime (Rust change)
  • Linting passes (ruff check .) — also cargo clippy -D warnings
  • Type checking passes (mypy headroom) — N/A (Rust-only runtime change)
  • New tests added for new functionality
  • Manual testing performed

Test Output

# Rust suites
cargo test -p headroom-core   ->  918 passed, 3 ignored (14 suites)
cargo test -p headroom-proxy  ->  407 passed (34 suites)
cargo test -p headroom-core --test live_zone_dispatch  ->  routing tests green

# Parity still byte-exact after wiring
[code_aware_compressor] total=30 matched=30 skipped=0 diffed=0
[kompress             ] total=21 matched=21 skipped=0 diffed=0

cargo clippy -p headroom-core -p headroom-proxy -p headroom-parity -- -D warnings  ->  No issues found
cargo fmt --check  ->  clean
ruff check . && ruff format --check .  ->  All checks passed

Real Behavior Proof

  • Environment: Linux (WSL2, kernel 6.6), Rust stable toolchain.
  • Exact command / steps: cargo test -p headroom-core and cargo test -p headroom-proxy; cargo test -p headroom-core --test live_zone_dispatch; re-ran the parity harness to confirm wiring did not perturb output.
  • Observed result: 918 core + 407 proxy tests pass; live-zone dispatch routing tests green; parity remains 30/30 code + 21/21 kompress; clippy and fmt clean. With the gate off (default), PlainText passes through unchanged; with --enable-kompress and a cold cache, dispatch still passes plain text through (Kompress unavailable) rather than blocking.
  • Not tested: end-to-end against a live provider (dispatcher is covered by integration tests, not live traffic in this PR); alternate ONNX execution providers (default CPU EP here).

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Additional Notes

  • Split of feat(rust): port CodeCompressor + Kompress live-zone compressors to Rust (parity-gated) #1143 — PR 3 of 3. Lands the engines from PR 1 + PR 2 into the dispatcher.
  • CCR markers are left to the dispatcher: both engines return the compressed string only; live-zone CCR uses the <<ccr:>> convention, so fixtures are recorded with enable_ccr=False for determinism.
  • pytest/mypy checklist items are N/A — Rust change; equivalent gates are cargo test + cargo clippy, run above.

Update — non-blocking model load + rebase

Pushed fix(kompress): never block the request path on model load and rebased this branch onto the updated #1153/#1154.

  • Non-blocking kompress() — the request-path accessor previously used OnceLock::get_or_init, so a request thread that hit a PlainText block while the ~261 MB model was still loading (or while a slow EP graph compile ran — the OpenVINO NPU compile takes ~13s+) blocked on that init and stalled the proxy. It's now a non-blocking OnceLock::get (returns None ⇒ pass through until ready); the one-time load happens only in warm_live_zone_compressors, off the request path. So an enabled-but-not-yet-warm Kompress degrades to passthrough instead of hanging live traffic.
  • Rebasefrom_cache/hf_cache_file moved to the engine PR (feat(rust): port Kompress ML prose compressor to Rust (parity-only, 1/3) #1153) where they belong; this branch's wiring commit reduces to the dispatch glue. Stack integrity intact (stack/1 ← stack/2 ← stack/3).

Validated downstream against a live Intel NPU run: proxy stays responsive through the NPU compile, then kompress_ready:true and prose blocks compress on-device. cargo fmt/clippy clean, dispatch + core tests green.

Ruben Avanesov added 2 commits June 19, 2026 10:56
Ports `headroom.transforms.kompress_compressor` (the ModernBERT token
compressor behind PlainText compression) to Rust — the last unported
compressor alongside CodeCompressor. Engine only; live-zone dispatch
wiring follows in a separate PR (matches how the other compressors
landed: port + parity first, then wire).

What it does:
- Loads the trained `chopratejas/kompress-v2-base` ONNX model (the
  inference weights) + the `answerdotai/ModernBERT-base` tokenizer (a
  fine-tune reuses its base vocab). Runs the ONNX/proxy compression
  path: whitespace-split → 350-word chunks → pre-tokenized encode
  (is_split_into_words) → ONNX `final_scores` → max-score-per-word →
  keep `> 0.5` → join kept words. <10 words passes through.
- Inference via `ort` (direct dep added here — first session-API
  consumer; unifies to the ORT instance fastembed/magika already
  vendor). Tokenization via the `tokenizers` crate. Both reproduce the
  Python `transformers`/onnxruntime path exactly.

Parity (byte-exact against the Python reference):
- Tokenizer `input_ids`/`word_ids` reproduce HF on pre-tokenized input.
- ONNX scores match to ~1e-6 (far below the 0.5 keep threshold).
- Kept-word set + joined output match byte-for-byte.
- `KompressComparator` wired into the parity harness; 21 fixtures
  recorded via recorder.py (enable_ccr=False → deterministic output).
  `cargo run -p headroom-parity -- run --only kompress`: 21/21 matched.
  Model-gated: skips (not fails) when the model isn't in the HF cache.

Tests:
- crates/headroom-core/tests/kompress_parity.rs (byte-parity + passthrough)
- module unit tests (config defaults, result helpers)
- clippy clean, no regressions across the full parity suite.

CCR offload of dropped words is left to the dispatcher (the `<<ccr:>>`
convention), not this engine — the inline Python marker is intentionally
not reproduced.
@github-actions

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added the status: ready for review Pull request body is complete and the author marked it ready for human review label Jun 19, 2026
Ruben Avanesov added 8 commits June 19, 2026 13:10
…ape ONNX

Three self-contained additions to the cache-only loader, all in kompress.rs:

- Cross-platform HF cache resolution: resolve the model via HF_HUB_CACHE /
  HF_HOME / HOME / USERPROFILE (was $HOME-only), so from_cache finds the
  model on native Windows, not just Linux.
- Load diagnostics: from_cache now logs WHY it defers (kompress_cache_miss /
  kompress_session_build_failed with the searched roots + the real session
  build error) instead of returning None silently — the #1 cause of an
  unexplained kompress_ready=false.
- Static-shape ONNX support: detect a fixed input_ids seq dimension on the
  loaded model and right-pad each chunk to it (masked padding => identical
  scores); dynamic models keep their natural length at zero padding cost.
  Enables execution providers that cannot compile dynamic shapes (OpenVINO
  NPU). Parity 21/21 in both static and dynamic modes.
…harness

Ports headroom.transforms.code_compressor (CodeAwareCompressor) to Rust on the
same branch as the Kompress port, so one PR delivers both new compressors
(SmartCrusher is already upstream).

Engine (crates/headroom-core/src/transforms/code_compressor.rs):
- tree-sitter AST parsing for python/js/ts/go/rust/java/c/cpp
- language detection (regex prefilter -> fewest-errors tree-sitter)
- symbol-importance scoring (min-max normalized, round-3 half-even) + body
  budget allocation, statement-level body truncation, omitted-line comments
  with call info, Python docstring handling (first_line/full/remove incl.
  multiline first-line reconstruction), syntax-validity guard (re-parse;
  return original on ERROR/MISSING).

Grammar-version parity is the precondition: the Rust tree-sitter-<lang> crates
are pinned to the exact versions of the Python wheels the fixtures were
recorded against (same version number on crates.io + PyPI = same grammar
source = identical ASTs). A canary over 9 samples x 8 languages confirmed
node-for-node identical node-type + line-span trees at these pins.

Parity: cargo run -p headroom-parity -- run --only code_aware_compressor
-> 30/30 matched (18 non-trivial across all 8 languages + unknown +
invalid-syntax fallback; all 3 docstring modes). py_round_int/py_round3
half-to-even verified against CPython. Integration test +
record_code_compressor_fixtures.py mirror the Kompress harness. Fixtures
recorded with enable_ccr=False + fallback_to_kompress=False for determinism;
live-zone dispatch wiring (SourceCode slot) is a deliberate follow-up.
Mirrors the Python content_router so the two newly-ported compressors
actually run in the proxy's live zone (they were engine-only before).

- ContentType::SourceCode → CodeCompressor. The grammars are statically
  linked, so the singleton constructs in microseconds — a synchronous
  one-liner like Diff/Log/Search. Flips the existing
  `source_code_tool_result_routes_to_no_op` contract test (whose comment
  invited a future "wire it up" PR) to assert code_aware_compressor.

- ContentType::PlainText → Kompress, loaded CACHE-ONLY. This mirrors the
  Python reference's `allow_download=False` preload path: never download
  on a hot/startup thread; when the ~261 MB model isn't in the local HF
  cache, yield None and pass the text through unchanged, exactly as Python
  does when Kompress is unavailable. New `Kompress::from_cache` constructor
  + `hf_cache_file` resolver back this. The PlainText routing test is
  model-gated (asserts strategy "kompress" when cached, passthrough when
  not), matching the kompress parity test's gating.

- `warm_live_zone_compressors()` (exported) mirrors Python's
  `eager_load_compressors`: force the cache-only singletons off the request
  path. Proxy startup can call it; the lazy path works without it.

No regressions: headroom-core 918 tests + 7 live_zone_dispatch tests pass,
full parity unchanged (all comparators green), clippy clean across
core/parity/proxy, workspace builds.
Kompress carries a ~261 MB ONNX model, so — unlike the always-on structural
compressors and the AST CodeCompressor — it now loads only when an operator
opts in, mirroring the Python reference's `config.enable_kompress`.

- core: process-wide `KOMPRESS_ENABLED` (default off) + `set_kompress_enabled`.
  `kompress()` checks it before the OnceLock, so a disabled proxy never loads
  the model and PlainText passes through. `warm_live_zone_compressors` only
  warms Kompress when enabled.
- proxy: `--enable-kompress` / `HEADROOM_PROXY_ENABLE_KOMPRESS` flag (default
  false), threaded through CliArgs → Config → for_test. main.rs sets the gate
  from config and, when enabled, fires a `spawn_blocking` cache-only warm-up
  off the request path (mirrors Python's `eager_load_compressors`) — a cold
  cache just leaves it deferred rather than stalling the bind.
- the model-gated PlainText dispatch test enables the gate explicitly, like
  the proxy does at startup.

No regressions: core 918 passed, proxy 407 passed, clippy clean across
core/parity/proxy, workspace builds.
kompress() used OnceLock::get_or_init, so a request thread that hit a
PlainText block while the ~261 MB model was still loading (or while a slow
EP graph compile ran) blocked on that init and stalled the proxy. Make the
request-path accessor a non-blocking OnceLock::get (None => pass through
until ready); perform the one-time load only in warm_live_zone_compressors,
off the request path.
@RubenAAA

Copy link
Copy Markdown
Author

Rebased onto the updated #1153/#1154 and added fix(kompress): never block the request path on model load — the old get_or_init accessor blocked request threads on the (slow, ~13s+ on NPU) model load and stalled the proxy; it's now a non-blocking get that degrades to passthrough until the off-path warm completes. Stack integrity intact. Description updated.

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not merge-ready in current GitHub state (mergeStateStatus=UNSTABLE). Please update from current main, resolve any conflicts if present, and rerun/clear required CI before this can be approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants