Skip to content

feat(rust): port CodeCompressor AST compressor to Rust (parity-only, 2/3)#1154

Open
RubenAAA wants to merge 5 commits into
chopratejas:mainfrom
RubenAAA:stack/2-code-compressor
Open

feat(rust): port CodeCompressor AST compressor to Rust (parity-only, 2/3)#1154
RubenAAA wants to merge 5 commits into
chopratejas:mainfrom
RubenAAA:stack/2-code-compressor

Conversation

@RubenAAA

Copy link
Copy Markdown

Description

Ports the CodeCompressor — an AST, syntax-preserving source-code compressor (tree-sitter) — to the Rust headroom-core engine, a byte-for-byte port of headroom/transforms/code_compressor.py (CodeAwareCompressor). This is PR 2 of 3 splitting #1143. It adds the engine + parity fixtures only and does not wire it into the live-zone dispatcher (PR 3). The engine is additive and inert until dispatch lands.

Stacked on PR 1 (Kompress). Until PR 1 merges, this PR's diff against main also includes PR 1's content; review PR 1 first.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • headroom-core: transforms/code_compressor.rs — full CodeCompressor port: language detection (regex prefilter then fewest-errors tree-sitter), symbol-importance scoring (min-max normalized, CPython half-to-even rounding), per-function body-budget allocation, statement-level body truncation with # [N lines omitted; calls: ...] summaries, Python docstring handling, and a re-parse syntax-validity guard that returns the original on any invalid output.
  • headroom-core: transforms/mod.rs — exports the new module.
  • headroom-parity: CodeCompressorComparator + recorder (scripts/record_code_compressor_fixtures.py) + 30 recorded fixtures under tests/parity/fixtures/code_aware_compressor/.
  • Cargo.toml: adds tree-sitter (0.25.2) + 8 grammar crates, each pinned (=) to the exact version of the Python tree-sitter-* wheels the fixtures were recorded against, so Rust and Python parsers emit node-for-node identical ASTs (the precondition for byte-parity).

Testing

  • Unit tests pass (pytest) — N/A for runtime (Rust change); dev-time fixture tooling passes ruff check + ruff format
  • Linting passes (ruff check .) — also cargo clippy -D warnings
  • Type checking passes (mypy headroom) — N/A (Rust-only runtime change)
  • New tests added for new functionality
  • Manual testing performed

Test Output

# Byte-parity: Rust port vs Python reference (headroom-parity harness)
cargo run -p headroom-parity -- run --only code_aware_compressor
[code_aware_compressor] total=30 matched=30 skipped=0 diffed=0

# Grammar canary (precondition): node-type + line-span trees, Python vs pinned Rust crates
=> 9/9 samples across 8 languages produce byte-identical ASTs

# Rust suites
cargo test -p headroom-core   ->  917 passed, 3 ignored (14 suites)
cargo clippy -p headroom-core -p headroom-parity -- -D warnings  ->  No issues found
cargo fmt --check  ->  clean
ruff check . && ruff format --check .  ->  All checks passed

Real Behavior Proof

  • Environment: Linux (WSL2, kernel 6.6), Rust stable toolchain, Python 3.13 for reference recording.
  • Exact command / steps: grammar canary (dump + compare node-type/line-span trees, Python tree-sitter stack vs pinned Rust crates); then cargo run -p headroom-parity -- run --only code_aware_compressor; then cargo test -p headroom-core.
  • Observed result: 9/9 canary samples produce byte-identical ASTs; 30/30 code fixtures match the Python reference byte-for-byte; cargo test -p headroom-core reports 917 passed, 3 ignored; clippy and fmt clean. Compressed output preserves signatures, imports, the first statement of each body, and inter-function call edges; output re-parses cleanly (the syntax-validity guard returns the original otherwise).
  • Not tested: live-zone dispatch routing (intentionally not wired here — lands in PR 3); non-ASCII identifiers/string contents (documented out-of-parity-scope).

Review Readiness

  • I have performed a self-review
  • This PR is ready for human review

Additional Notes

  • Split of feat(rust): port CodeCompressor + Kompress live-zone compressors to Rust (parity-gated) #1143 — PR 2 of 3, stacked on PR 1. PR 3 wires both engines into the live-zone dispatcher and adds the --enable-kompress gate.
  • Grammar-version parity is the make-or-break invariant: same version number on crates.io + PyPI means the same grammar.js source, hence the same generated parser.c. Bumping any grammar pin requires re-running the canary and re-recording fixtures.
  • pytest/mypy checklist items are N/A — Rust change; equivalent gates are cargo test + cargo clippy, run above.

Ruben Avanesov added 2 commits June 19, 2026 10:56
Ports `headroom.transforms.kompress_compressor` (the ModernBERT token
compressor behind PlainText compression) to Rust — the last unported
compressor alongside CodeCompressor. Engine only; live-zone dispatch
wiring follows in a separate PR (matches how the other compressors
landed: port + parity first, then wire).

What it does:
- Loads the trained `chopratejas/kompress-v2-base` ONNX model (the
  inference weights) + the `answerdotai/ModernBERT-base` tokenizer (a
  fine-tune reuses its base vocab). Runs the ONNX/proxy compression
  path: whitespace-split → 350-word chunks → pre-tokenized encode
  (is_split_into_words) → ONNX `final_scores` → max-score-per-word →
  keep `> 0.5` → join kept words. <10 words passes through.
- Inference via `ort` (direct dep added here — first session-API
  consumer; unifies to the ORT instance fastembed/magika already
  vendor). Tokenization via the `tokenizers` crate. Both reproduce the
  Python `transformers`/onnxruntime path exactly.

Parity (byte-exact against the Python reference):
- Tokenizer `input_ids`/`word_ids` reproduce HF on pre-tokenized input.
- ONNX scores match to ~1e-6 (far below the 0.5 keep threshold).
- Kept-word set + joined output match byte-for-byte.
- `KompressComparator` wired into the parity harness; 21 fixtures
  recorded via recorder.py (enable_ccr=False → deterministic output).
  `cargo run -p headroom-parity -- run --only kompress`: 21/21 matched.
  Model-gated: skips (not fails) when the model isn't in the HF cache.

Tests:
- crates/headroom-core/tests/kompress_parity.rs (byte-parity + passthrough)
- module unit tests (config defaults, result helpers)
- clippy clean, no regressions across the full parity suite.

CCR offload of dropped words is left to the dispatcher (the `<<ccr:>>`
convention), not this engine — the inline Python marker is intentionally
not reproduced.
@github-actions

Copy link
Copy Markdown
Contributor

PR governance

This PR follows the template and is marked ready for human review.

@github-actions github-actions Bot added the status: ready for review Pull request body is complete and the author marked it ready for human review label Jun 19, 2026
Ruben Avanesov added 3 commits June 19, 2026 13:10
…ape ONNX

Three self-contained additions to the cache-only loader, all in kompress.rs:

- Cross-platform HF cache resolution: resolve the model via HF_HUB_CACHE /
  HF_HOME / HOME / USERPROFILE (was $HOME-only), so from_cache finds the
  model on native Windows, not just Linux.
- Load diagnostics: from_cache now logs WHY it defers (kompress_cache_miss /
  kompress_session_build_failed with the searched roots + the real session
  build error) instead of returning None silently — the #1 cause of an
  unexplained kompress_ready=false.
- Static-shape ONNX support: detect a fixed input_ids seq dimension on the
  loaded model and right-pad each chunk to it (masked padding => identical
  scores); dynamic models keep their natural length at zero padding cost.
  Enables execution providers that cannot compile dynamic shapes (OpenVINO
  NPU). Parity 21/21 in both static and dynamic modes.
…harness

Ports headroom.transforms.code_compressor (CodeAwareCompressor) to Rust on the
same branch as the Kompress port, so one PR delivers both new compressors
(SmartCrusher is already upstream).

Engine (crates/headroom-core/src/transforms/code_compressor.rs):
- tree-sitter AST parsing for python/js/ts/go/rust/java/c/cpp
- language detection (regex prefilter -> fewest-errors tree-sitter)
- symbol-importance scoring (min-max normalized, round-3 half-even) + body
  budget allocation, statement-level body truncation, omitted-line comments
  with call info, Python docstring handling (first_line/full/remove incl.
  multiline first-line reconstruction), syntax-validity guard (re-parse;
  return original on ERROR/MISSING).

Grammar-version parity is the precondition: the Rust tree-sitter-<lang> crates
are pinned to the exact versions of the Python wheels the fixtures were
recorded against (same version number on crates.io + PyPI = same grammar
source = identical ASTs). A canary over 9 samples x 8 languages confirmed
node-for-node identical node-type + line-span trees at these pins.

Parity: cargo run -p headroom-parity -- run --only code_aware_compressor
-> 30/30 matched (18 non-trivial across all 8 languages + unknown +
invalid-syntax fallback; all 3 docstring modes). py_round_int/py_round3
half-to-even verified against CPython. Integration test +
record_code_compressor_fixtures.py mirror the Kompress harness. Fixtures
recorded with enable_ccr=False + fallback_to_kompress=False for determinism;
live-zone dispatch wiring (SourceCode slot) is a deliberate follow-up.

@JerrettDavis JerrettDavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is not merge-ready in current GitHub state (mergeStateStatus=UNSTABLE). Please update from current main, resolve any conflicts if present, and rerun/clear required CI before this can be approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready for review Pull request body is complete and the author marked it ready for human review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants