feat(rust): port CodeCompressor AST compressor to Rust (parity-only, 2/3)#1154
Open
RubenAAA wants to merge 5 commits into
Open
feat(rust): port CodeCompressor AST compressor to Rust (parity-only, 2/3)#1154RubenAAA wants to merge 5 commits into
RubenAAA wants to merge 5 commits into
Conversation
added 2 commits
June 19, 2026 10:56
Ports `headroom.transforms.kompress_compressor` (the ModernBERT token compressor behind PlainText compression) to Rust — the last unported compressor alongside CodeCompressor. Engine only; live-zone dispatch wiring follows in a separate PR (matches how the other compressors landed: port + parity first, then wire). What it does: - Loads the trained `chopratejas/kompress-v2-base` ONNX model (the inference weights) + the `answerdotai/ModernBERT-base` tokenizer (a fine-tune reuses its base vocab). Runs the ONNX/proxy compression path: whitespace-split → 350-word chunks → pre-tokenized encode (is_split_into_words) → ONNX `final_scores` → max-score-per-word → keep `> 0.5` → join kept words. <10 words passes through. - Inference via `ort` (direct dep added here — first session-API consumer; unifies to the ORT instance fastembed/magika already vendor). Tokenization via the `tokenizers` crate. Both reproduce the Python `transformers`/onnxruntime path exactly. Parity (byte-exact against the Python reference): - Tokenizer `input_ids`/`word_ids` reproduce HF on pre-tokenized input. - ONNX scores match to ~1e-6 (far below the 0.5 keep threshold). - Kept-word set + joined output match byte-for-byte. - `KompressComparator` wired into the parity harness; 21 fixtures recorded via recorder.py (enable_ccr=False → deterministic output). `cargo run -p headroom-parity -- run --only kompress`: 21/21 matched. Model-gated: skips (not fails) when the model isn't in the HF cache. Tests: - crates/headroom-core/tests/kompress_parity.rs (byte-parity + passthrough) - module unit tests (config defaults, result helpers) - clippy clean, no regressions across the full parity suite. CCR offload of dropped words is left to the dispatcher (the `<<ccr:>>` convention), not this engine — the inline Python marker is intentionally not reproduced.
Contributor
PR governanceThis PR follows the template and is marked ready for human review. |
21 tasks
added 3 commits
June 19, 2026 13:10
…ape ONNX Three self-contained additions to the cache-only loader, all in kompress.rs: - Cross-platform HF cache resolution: resolve the model via HF_HUB_CACHE / HF_HOME / HOME / USERPROFILE (was $HOME-only), so from_cache finds the model on native Windows, not just Linux. - Load diagnostics: from_cache now logs WHY it defers (kompress_cache_miss / kompress_session_build_failed with the searched roots + the real session build error) instead of returning None silently — the #1 cause of an unexplained kompress_ready=false. - Static-shape ONNX support: detect a fixed input_ids seq dimension on the loaded model and right-pad each chunk to it (masked padding => identical scores); dynamic models keep their natural length at zero padding cost. Enables execution providers that cannot compile dynamic shapes (OpenVINO NPU). Parity 21/21 in both static and dynamic modes.
…harness Ports headroom.transforms.code_compressor (CodeAwareCompressor) to Rust on the same branch as the Kompress port, so one PR delivers both new compressors (SmartCrusher is already upstream). Engine (crates/headroom-core/src/transforms/code_compressor.rs): - tree-sitter AST parsing for python/js/ts/go/rust/java/c/cpp - language detection (regex prefilter -> fewest-errors tree-sitter) - symbol-importance scoring (min-max normalized, round-3 half-even) + body budget allocation, statement-level body truncation, omitted-line comments with call info, Python docstring handling (first_line/full/remove incl. multiline first-line reconstruction), syntax-validity guard (re-parse; return original on ERROR/MISSING). Grammar-version parity is the precondition: the Rust tree-sitter-<lang> crates are pinned to the exact versions of the Python wheels the fixtures were recorded against (same version number on crates.io + PyPI = same grammar source = identical ASTs). A canary over 9 samples x 8 languages confirmed node-for-node identical node-type + line-span trees at these pins. Parity: cargo run -p headroom-parity -- run --only code_aware_compressor -> 30/30 matched (18 non-trivial across all 8 languages + unknown + invalid-syntax fallback; all 3 docstring modes). py_round_int/py_round3 half-to-even verified against CPython. Integration test + record_code_compressor_fixtures.py mirror the Kompress harness. Fixtures recorded with enable_ccr=False + fallback_to_kompress=False for determinism; live-zone dispatch wiring (SourceCode slot) is a deliberate follow-up.
92a9c51 to
c5dd978
Compare
13 tasks
JerrettDavis
requested changes
Jun 19, 2026
JerrettDavis
left a comment
Collaborator
There was a problem hiding this comment.
This PR is not merge-ready in current GitHub state (mergeStateStatus=UNSTABLE). Please update from current main, resolve any conflicts if present, and rerun/clear required CI before this can be approved.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Ports the CodeCompressor — an AST, syntax-preserving source-code compressor (tree-sitter) — to the Rust
headroom-coreengine, a byte-for-byte port ofheadroom/transforms/code_compressor.py(CodeAwareCompressor). This is PR 2 of 3 splitting #1143. It adds the engine + parity fixtures only and does not wire it into the live-zone dispatcher (PR 3). The engine is additive and inert until dispatch lands.Stacked on PR 1 (Kompress). Until PR 1 merges, this PR's diff against
mainalso includes PR 1's content; review PR 1 first.Type of Change
Changes Made
headroom-core:transforms/code_compressor.rs— full CodeCompressor port: language detection (regex prefilter then fewest-errors tree-sitter), symbol-importance scoring (min-max normalized, CPython half-to-even rounding), per-function body-budget allocation, statement-level body truncation with# [N lines omitted; calls: ...]summaries, Python docstring handling, and a re-parse syntax-validity guard that returns the original on any invalid output.headroom-core:transforms/mod.rs— exports the new module.headroom-parity:CodeCompressorComparator+ recorder (scripts/record_code_compressor_fixtures.py) + 30 recorded fixtures undertests/parity/fixtures/code_aware_compressor/.Cargo.toml: addstree-sitter(0.25.2) + 8 grammar crates, each pinned (=) to the exact version of the Pythontree-sitter-*wheels the fixtures were recorded against, so Rust and Python parsers emit node-for-node identical ASTs (the precondition for byte-parity).Testing
pytest) — N/A for runtime (Rust change); dev-time fixture tooling passesruff check+ruff formatruff check .) — alsocargo clippy -D warningsmypy headroom) — N/A (Rust-only runtime change)Test Output
Real Behavior Proof
cargo run -p headroom-parity -- run --only code_aware_compressor; thencargo test -p headroom-core.cargo test -p headroom-corereports 917 passed, 3 ignored; clippy and fmt clean. Compressed output preserves signatures, imports, the first statement of each body, and inter-function call edges; output re-parses cleanly (the syntax-validity guard returns the original otherwise).Review Readiness
Additional Notes
--enable-kompressgate.grammar.jssource, hence the same generatedparser.c. Bumping any grammar pin requires re-running the canary and re-recording fixtures.pytest/mypychecklist items are N/A — Rust change; equivalent gates arecargo test+cargo clippy, run above.