LLM Compression Layer

Universal semantic compression layer for LLM inputs. Compresses memories, code, and context before API calls while preserving reasoning equivalence across Claude, GPT, and Gemini.

Quick Start

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v

Environment Variables

Create a .env file with:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...
HF_TOKEN=hf_...
TINKER_API_KEY=tk_...

Ground-Truth Run Snapshot (from local artifacts)

These metrics are taken directly from run/eval artifacts in this repo (not from memory):

models/runs/mlx/latest/run.json
models/runs/mlx/latest/train.log
models/adapters/tinker/tinker_run.json
models/eval/tinker_eval_tightened_full.jsonl
models/eval/tinker_equiv_tightened_full_llm_judge.jsonl

Local 4B run (MLX)

From models/runs/mlx/latest/run.json and models/runs/mlx/latest/train.log:

Metric	Value
Model	`mlx-community/Qwen3-4B-Instruct-2507-8bit`
LoRA	rank `8`, alpha `16`, layers `16`
Train config	`500` iters, batch size `4`, lr `1e-4`
Final val loss (iter 500)	`0.510`

Cloud 8B run (Tinker)

From models/adapters/tinker/tinker_run.json:

Metric	Value
Run ID	`161b1f39-3e50-53c0-9d75-f5ce804db7eb:train:0`
Model	`Qwen/Qwen3-8B`
Train examples	`19,845`
LoRA	rank `16`, alpha `32`, dropout `0.05`
Train config	epochs `2`, batch size `4`, lr `2e-4`
Status	completed, early stopped at step `4962` (planned total `9924`)
Best val loss	`0.2657`
Final loss	`0.2845`
Latest checkpoint	`tinker://.../weights/final`

8B evaluation results

Source files:

Compression: models/eval/tinker_eval_tightened_full.jsonl
Equivalence: models/eval/tinker_equiv_tightened_full_llm_judge.jsonl

Compression Results (8B, full test set)

Run	Split Size	Evaluated	Avg Compression Ratio	Avg Token Reduction	Median Compression Ratio
Tinker Qwen3-8B tightened eval	2,497	2,497	0.5207	47.93%	0.5847

Equivalence Results (8B, current progress)

Run	Threshold	Judge	Models	Evaluated	Coverage	Pass Rate	Avg Min-Equiv	Median Min-Equiv	Min Min-Equiv
Tinker Qwen3-8B tightened equiv	0.80	On (`--use-llm-judge`)	Claude Sonnet + GPT-4o-mini + Gemini 2.0 Flash	1,753	70.20% (1,753/2,497)	21.11%	0.6791	0.7102	0.1538

Equivalence Model Breakdown

Model	Combined Avg	Combined Min	Judge-Only Avg	Judge-Only Min
`claude-sonnet-4-20250514`	0.7868	0.1842	0.7784	0.2500
`gpt-4o-mini`	0.7911	0.1538	0.7777	0.2500
`gemini-2.0-flash`	0.7092	0.1682	0.6968	0.2562

Local Tinker-Style 4B Training (MLX backend)

This uses the same train_tinker.py workflow but runs locally on MLX.

source .venv/bin/activate

python scripts/train_tinker.py \
  --backend local \
  --model Qwen/Qwen3-4B-Instruct-2507 \
  --local-model mlx-community/Qwen3-4B-Instruct-2507-8bit \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/adapters/tinker_local_4b

Evaluate that local run with the same eval workflow:

python scripts/evaluate_tinker.py \
  --backend local \
  --model mlx-community/Qwen3-4B-Instruct-2507-8bit \
  --adapter-path models/adapters/tinker_local_4b \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/eval/tinker4b_local_eval.jsonl

Data Bootstrapping: Generate -> Clean -> Format -> Sanitize

1) Generate synthetic pairs with local adapter

python scripts/generate_synthetic.py \
  --input data/raw/code.jsonl \
  --domain code \
  --adapter models/runs/mlx/latest/adapter \
  --output data/synthetic/code_v2.jsonl \
  --resume

python scripts/generate_synthetic.py \
  --input data/raw/nl_docs.jsonl \
  --domain nl \
  --adapter models/runs/mlx/latest/adapter \
  --output data/synthetic/nl_v2.jsonl \
  --resume

2) Clean and ratio-filter synthetic outputs

python scripts/preprocess_synthetic.py \
  --input data/synthetic/code_v2.jsonl \
  --output data/validated/code_pairs.jsonl \
  --rejected data/validated/rejected_code_pairs.jsonl \
  --max-char-ratio 1.0 \
  --max-token-ratio 1.0

python scripts/preprocess_synthetic.py \
  --input data/synthetic/nl_v2.jsonl \
  --output data/validated/nl_pairs.jsonl \
  --rejected data/validated/rejected_nl_pairs.jsonl \
  --max-char-ratio 0.95 \
  --max-token-ratio 0.95

3) Format validated pairs into train/valid/test

python scripts/format_training_data.py \
  --input data/validated \
  --output data/training \
  --train-ratio 0.8 \
  --valid-ratio 0.1 \
  --test-ratio 0.1 \
  --seed 42

4) Sanitize each split

for split in train valid test; do
  python scripts/data_sanitization.py \
    --input "data/training/${split}.jsonl" \
    --sanitized "data/training/sanitized_${split}.jsonl" \
    --unsanitized "data/training/unsanitized_${split}.jsonl"
done

for split in train valid test; do
  mv "data/training/sanitized_${split}.jsonl" "data/training/${split}.jsonl"
done

Current data counts in workspace

File	Rows
`data/raw/code.jsonl`	1,571
`data/raw/nl_docs.jsonl`	1,234
`data/synthetic/code_v2.jsonl`	17,315
`data/synthetic/nl_v2.jsonl`	17,056
`data/validated/code_pairs.jsonl`	17,125
`data/validated/nl_pairs.jsonl`	10,505
`data/training/train.jsonl`	19,845
`data/training/valid.jsonl`	2,473
`data/training/test.jsonl`	2,497

Reproduce the 8B Evaluation Pipeline

1) Run generation eval against Tinker checkpoint

evaluate_tinker.py will auto-read checkpoint from models/adapters/tinker/tinker_run.json if --checkpoint-path is omitted.

python scripts/evaluate_tinker.py \
  --backend tinker \
  --hf-dataset Sudhendra/semantic-compression-sft \
  --output models/eval/tinker_eval_tightened_full.jsonl \
  --show-examples 0 \
  --resume

2) Convert generation output to validation pairs

python - <<'PY'
import json
from pathlib import Path
from src.inference.domain_classifier import DomainClassifier

src = Path("models/eval/tinker_eval_tightened_full.jsonl")
dst = Path("models/eval/tinker_eval_tightened_full_pairs.jsonl")
clf = DomainClassifier()

with src.open(encoding="utf-8") as fin, dst.open("w", encoding="utf-8") as fout:
    for line in fin:
        if not line.strip():
            continue
        row = json.loads(line)
        verbose = row["input_text"]
        compressed = row["generated_output"]
        domain = clf.classify(verbose).value
        fout.write(json.dumps({"verbose": verbose, "compressed": compressed, "domain": domain}) + "\n")

print(dst)
PY

3) Run equivalence eval (with LLM judge)

python scripts/validate_batch.py \
  --input models/eval/tinker_eval_tightened_full_pairs.jsonl \
  --output models/eval/tinker_equiv_tightened_full_llm_judge.jsonl \
  --models claude gpt gemini \
  --threshold 0.80 \
  --use-llm-judge \
  --save-all \
  --concurrency 2 \
  --resume

Project Structure

compression-layer/
├── src/
│   ├── validation/     # Cross-model equivalence testing
│   ├── generation/     # Compression pair generation
│   ├── training/       # Tinker + MLX training
│   ├── inference/      # Compression inference paths
│   └── utils/          # Tokenizers, caching, cost tracking
├── data/               # Corpora and generated datasets (gitignored)
├── models/             # Checkpoints and eval artifacts (gitignored)
├── configs/            # YAML configs
├── scripts/            # Entry points
└── tests/

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
.github/workflows		.github/workflows
configs		configs
data/training		data/training
docs		docs
scripts		scripts
src		src
tests		tests
visualization		visualization
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
autoformat.sh		autoformat.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Compression Layer

Quick Start

Environment Variables

Ground-Truth Run Snapshot (from local artifacts)

Local 4B run (MLX)

Cloud 8B run (Tinker)

8B evaluation results

Compression Results (8B, full test set)

Equivalence Results (8B, current progress)

Equivalence Model Breakdown

Local Tinker-Style 4B Training (MLX backend)

Data Bootstrapping: Generate -> Clean -> Format -> Sanitize

1) Generate synthetic pairs with local adapter

2) Clean and ratio-filter synthetic outputs

3) Format validated pairs into train/valid/test

4) Sanitize each split

Current data counts in workspace

Reproduce the 8B Evaluation Pipeline

1) Run generation eval against Tinker checkpoint

2) Convert generation output to validation pairs

3) Run equivalence eval (with LLM judge)

Project Structure

License

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

Sudhendra/compression-layer

Folders and files

Latest commit

History

Repository files navigation

LLM Compression Layer

Quick Start

Environment Variables

Ground-Truth Run Snapshot (from local artifacts)

Local 4B run (MLX)

Cloud 8B run (Tinker)

8B evaluation results

Compression Results (8B, full test set)

Equivalence Results (8B, current progress)

Equivalence Model Breakdown

Local Tinker-Style 4B Training (MLX backend)

Data Bootstrapping: Generate -> Clean -> Format -> Sanitize

1) Generate synthetic pairs with local adapter

2) Clean and ratio-filter synthetic outputs

3) Format validated pairs into train/valid/test

4) Sanitize each split

Current data counts in workspace

Reproduce the 8B Evaluation Pipeline

1) Run generation eval against Tinker checkpoint

2) Convert generation output to validation pairs

3) Run equivalence eval (with LLM judge)

Project Structure

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages