feat: upgrade LRP merge method to Transformer-aware eX-LRP with AttnLRP propagation with Google's TurboQuant support#682
feat: upgrade LRP merge method to Transformer-aware eX-LRP with AttnLRP propagation with Google's TurboQuant support#682Tusm11 wants to merge 5 commits into
Conversation
|
Thanks for the iteration on this — pointing at AttnLRP and Blocker 1 — the merge silently falls back to magnitude pruning
# mergekit/merge_methods/lrp.py L97-114
importance = None
ref_str = str(ref)
if self.lrp_scores is not None and ref_str in self.lrp_scores:
lrp_path = self.lrp_scores[ref_str]
if lrp_path not in _lrp_cache:
_lrp_cache[lrp_path] = torch.load(lrp_path, map_location="cpu")
importance = _lrp_cache[lrp_path].get(self.weight_info.name)
if importance is not None:
importance = importance.to(delta.device)
# Fallback to magnitude-based importance
if importance is None:
importance = delta.abs()But merge_method: lrp
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
parameters: { density: 0.7 }
models:
- model: psmathur/orca_mini_v3_13b
- model: garage-bAInd/Platypus2-13BSo as merged today, Fix: add a
Whichever you pick, please add an integration test that asserts the resulting mask differs from a magnitude-only baseline on a non-trivial weight — right now nothing in CI would catch the silent fallback. Blocker 2 —
|
| model | fp16 weights | with bnb 4-bit | activations (checkpointed) | total on 24 GB? |
|---|---|---|---|---|
| TinyLlama 1.1B | 2.2 GB | 0.6 GB | ~200 MB | trivial |
| Llama-2 13B | 26 GB → OOM | 6.5 GB | ~800 MB | yes (with 4-bit) |
| Qwen3.5 27B | 54 GB → OOM | 13.5 GB | ~1.5 GB | yes (with 4-bit) |
Note: 4-bit works here because ∇_input propagates fine through a 4-bit Linear via bitsandbytes' dequantize-on-the-fly forward. For the per-weight R_w itself with quantized weights, switch to the equivalent epsilon-rule form R_w[j,i] = R_out[j] · x[i] / (out[j] + ε) — lxt does this internally when you register AttnLRP rules.
Smaller cleanup items
- A 0-byte file literally named
gitwas committed — remove it. LRP_Merge.ipynbis +5321 lines, mostly cell outputs. Strip outputs (jupyter nbconvert --clear-output) or move to a separate examples repo.finetune_fakenews.py(+392 lines) doesn't belong in a merge-method PR.examples/lrp.ymlmerges TinyLlama 1.1B (base) with two Llama-2 13B fine-tunes — incompatible architectures, will explode on shape mismatch even with--allow-crimes. Use two same-architecture fine-tunes of the same base.
Suggested acceptance criteria
Before this can land:
lrp_scoresreachable from YAML and validated end-to-end (or precompute baked intomake_task).- CI test that asserts the produced mask differs from
delta.abs()magnitude top-k on at least one weight in a small toy model. lrp_computer.pyeither replaced with anlxt-based implementation or removed in favor of an external precompute step.- Empirical comparison vs. plain magnitude DARE on at least one downstream benchmark (TinyLlama or 1B-class is fine for a sanity check) showing the LRP path doesn't regress.
Happy to review again once those are in.
|
Big improvement — thanks for the rapid turnaround. The structural blockers from my last review are addressed: Fixed in 0125b2e / 3e92daf / 4ac9c1d:
That's the bulk of the work. A handful of follow-ups remain — most are also flagged by Cursor Bugbot, so it's probably easiest to address them in one pass: Remaining issues1. The new CI test is dead-on-arrival (High)
Two reasonable fixes:
I'd go with (a) — it answers the question "does providing relevance actually change anything compared to a real magnitude method?" rather than "does providing relevance change anything compared to providing different relevance?". 2. lxt failure should be loud, not a warning (Medium)In try:
if "llama" in model_type:
from lxt.models.llama import attnlrp
elif "qwen" in model_type:
from lxt.models.qwen2 import attnlrp
elif "mistral" in model_type:
from lxt.models.mistral import attnlrp
else:
raise ValueError(
f"AttnLRP not supported for model_type={model_type!r}. "
f"Currently supported: llama, qwen2/2.5, mistral. "
f"For other architectures, contribute an lxt rules module or use a different importance method."
)
attnlrp.register(self.model)
except ImportError:
raise ImportError("lxt is required for AttnLRP. Install with: pip install lxt") from NoneAdd a corresponding Also flag this in the README — currently it says "AttnLRP" without listing supported architectures. Note that 3. 13B target won't fit on a 24 GB card without 4-bit loading (Medium)The example YAML targets Llama-2-13B, but def load_model(self, load_in_4bit: bool = False) -> None:
...
quant_config = None
if load_in_4bit:
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
self.model = AutoModelForCausalLM.from_pretrained(
self.config.model_path,
torch_dtype=torch_dtype,
quantization_config=quant_config,
device_map=device_map,
low_cpu_mem_usage=True,
)And expose 4.
|
|
Quick correction on the must-fix triage from my previous comment, after rereading the diff: Promoting items 4 and 6 to must-fix-before-merge:
Reframing item 3 (4-bit loading): You're right that this isn't a hard blocker — anyone running a 13B+ AttnLRP precompute is realistically on a 48 GB+ rented pod (A40 / A6000 / L40S / single H100), where fp16 fits with headroom. So I'd downgrade this from must-fix to "strongly recommended". Still worth adding because:
So: please add Updated must-fix list before merge:
Everything else from my prior comment can land as follow-up. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a928e72. Configure here.
|
Thanks for the detailed review, @mann1x. I’ve addressed the must-fix issues along with the points raised by Cursor Bugbot. Summary of Fixes
|
|
Verified all 4 must-fix items — looks good to me:
Bonus fixes I noticed went in beyond what we asked:
One downstream consequence worth noting in the READMESwitching to fp32 for relevance computation roughly doubles the memory budget vs the original fp16 path:
The example YAML targets Llama-2-13B, which is now an H100-class job rather than a 48 GB card job. Worth a one-line README note on hardware expectations so users on smaller pods don't hit the wall mid-precompute. (Optional follow-up, not blocking: a My takeFrom my side: must-fix list is satisfied. Happy to defer to the maintainers on whether the LRP-vs-LRP test or the H100-class memory footprint should block landing, but those are choices about scope, not correctness gaps. Nice work on the rapid iteration. |
|
Thanks for the thorough re-review and detailed validation @mann1x. I appreciate it. Glad to hear the must-fix items are satisfied. This is actually my first open-source contribution, and I’m still an undergrad, so the review feedback and mentorship throughout this process have been extremely valuable for me. I’ll add the README hardware note as a small follow-up for the fp32 memory expectations. @cg123 — I’d appreciate your review on the current state of the PR when you have a chance. Appreciate all the guidance throughout the review. |
|
You're welcome, thank you for the contribution! |
|
Thank you @mann1x. That means a lot to hear, especially coming from someone who actively uses MergeKit for real workflows. I’m glad you found the contribution valuable, and I really appreciate all the time you took to review and help improve it throughout the process. |
|
Hey @mann1x, I've just pushed a major update to the PR! I've supercharged the LRP method with my "Turbo" optimizations and fixed all the multimodal issues you mentioned. Here’s what I did: -Multimodal Fix: It now handles models like Qwen-3.5-VL perfectly. I added logic to handle optional tensors and properly dispatch the language model branch. Everything is tested and working end-to-end now. I hope you will check it out. |
|
Big thanks to @Tusm11 for the supercharged ex-LRP turbo branch — re-ran my 4B Qwen3.5 merge study with HEAD Setup
Results (Q6_K, single 3090)
Δ M4-v2 vs M4-orig: HE +4.27 pp, MBPP +2.80 pp. M4-v2 takes the MBPP medal of the whole study while staying competitive on HumanEval. The turbo branch + rebalanced hyperparams clearly beat the original PR head on this configuration. M4-v2 weightsPublished at: https://huggingface.co/ManniX-ITA/Qwen3.5-4B-M4-v2-ex-LRP-turbo (Other variants in the study: M1, M2, M3, M4-orig, M5.) Patches needed against multimodal
|
|
Update: Final Multimodal & Architectural Refinements Hello @mann1x, Key Fixes Included: -Layer-Aware Weight Detection: Fixed architecture inference for hybrid models (like Qwen 3.5) by updating auto.py to check all layers for optionality. This prevents crashes on alternating attention layers. |
Field-test feedback from a Qwen3.5-4B merge studyHi @Tusm11 — running this PR end-to-end on a 3-model Qwen3.5-4B (base + 2 fine-tunes) merge today, ran into two issues. Reproducible on Issue 1 —
|
Follow-up: planner fix for hybrid architectures (Qwen3.5-4B and similar)Continuing from the earlier comment: tracked down the underlying reason mergekit-yaml + LRP fails on Qwen3.5-4B regardless of which commit on this branch I tried ( The bugIn optional = (full_name.replace("${layer_index}", "0") not in in_all_models) or ...Qwen3.5-4B uses hybrid attention — most layers are (Layer 7 only has The same bug also bites the MTP-layers regression I mentioned in the previous comment — except that one became visible only after FixA template should be optional if any of its layer instantiations are missing from - def _wi(template: str, prefix: str) -> WeightInfo:
+ def _wi(template: str, prefix: str, num_layers: int = 1) -> WeightInfo:
full_name = prefix + template
- optional = (full_name.replace("${layer_index}", "0") not in in_all_models) or (
+ # A template is optional if ANY of its layer instantiations are missing
+ # from in_all_models. Required for hybrid architectures like Qwen3.5
+ # (alternating self_attn / linear_attn per layer) where layer 0 may have
+ # both kinds of weights but later layers may have only one. Without this,
+ # layer 0 satisfies the lookup, but the planner emits required LoadTensor
+ # for `linear_attn.norm.weight` at every layer index 0..N-1 — and a
+ # self_attn-only layer raises at execute time.
+ if "${layer_index}" in full_name:
+ layer_optional = any(
+ full_name.replace("${layer_index}", str(i)) not in in_all_models
+ for i in range(num_layers)
+ )
+ else:
+ layer_optional = full_name not in in_all_models
+ optional = layer_optional or (
tied_keys is not None
and any(re.search(pat, full_name) for pat in tied_keys)
)
@@ -180,9 +194,9 @@ def infer_architecture_info(
definition=JsonModuleArchDef(
model_type="",
architectures=[],
- pre_weights=[_wi(t, "") for t in module_loose_weights[prefix]],
+ pre_weights=[_wi(t, "", num_layers) for t in module_loose_weights[prefix]],
layer_templates=JsonLayerTemplates(
- weights=[_wi(t, "") for t in module_templates[prefix]]
+ weights=[_wi(t, "", num_layers) for t in module_templates[prefix]]
),
VerificationBefore the patch, on
After the patch:
mergekit-yaml then runs to completion on this 3-model set with the LRP merge method. Combined with the previous commentThis patch + the safetensors loader patch from the previous comment together unblock the full LRP path on Qwen3.5-4B. (The MTP strict-presence regression in Happy to open a PR against your branch with both patches + a regression test if useful. Otherwise feel free to integrate directly. |
|
Hello @mann1x I've just pushed the set of refinements to unblock the full LRP path for hybrid-architecture models. What’s New: Hybrid-Architecture Planner Fix: Updated auto.py to be truly layer-aware. The planner now checks every layer instantiation (0..N-1) rather than just layer 0. This correctly identifies weights as optional in models like Qwen 3.5-4B (which alternates between self_attn and linear_attn), preventing the "Tensor required but not present" crash. |
|
Hi @cg123 , Just following up on PR #682. The review feedback from @mann1x has been addressed, and the must-fix items were verified in the review thread. I know everyone is busy, but I wanted to check whether there are any remaining concerns, requested changes, or next steps from the maintainer side. Thanks again for your time and consideration. |
|
Hey @Tusm11 https://github.com/arcee-ai/mergekit?tab=contributing-ov-file#contributor-license-agreement-cla Just post in a comment: I have read the CLA Document and I hereby sign the CLA Now I signed it too :) |
|
I have read the CLA Document and I hereby sign the CLA. |

his PR upgrades the existing LRP merge method into a more Transformer-aware version, which I’m calling eX-LRP.
The previous implementation already used relevance scores, but the main limitation was that it did not properly account for how relevance should propagate through Transformer architectures. In practice, that meant relevance attribution was not faithfully modeling residual connections, attention blocks, normalization layers, or the actual computation graph used in modern LLMs.
With this update, I reworked the implementation to use an AttnLRP-style propagation approach so relevance is computed in a way that better reflects how Transformer-based models make predictions.
Concretely, this update adds Transformer-aware relevance propagation through self-attention, MLP blocks, and normalization layers, introduces proportional relevance splitting across residual connections, and includes epsilon-based numerical stabilization for safer backward propagation in deep networks.
The relevance computation is now integrated directly into the merge pipeline and runs through a single backward pass seeded from the model’s predicted token logits. This allows weight importance to be determined based on prediction-faithful relevance attribution rather than simplified propagation assumptions.
I also updated the example LRP configuration and expanded the README with documentation covering the new algorithm, usage instructions, and implementation details.
This is a breaking behavioral change in the sense that merge_method: lrp will now use the new Transformer-aware eX-LRP implementation automatically, though no config changes are required.
Tested on TinyLlama, LLaMA 2, and Mistral-based checkpoints with positive results in preserving fine-tuned behavior while maintaining base model capabilities.
Primary review areas:
Correctness of Transformer relevance propagation
Residual split logic
Numerical stability safeguards
Regression/merge quality compared to prior LRP implementation
I would also like to thank @mann1x for suggesting me [https://github.com/rachtibat/LRP-eXplains-Transformers] this github repo.
eX-LRP: https://github.com/Tusm11/ex-LRP.git
References
Note
High Risk
Replaces how all LRP importance is computed and applied, affecting which weights survive sparsification across the full model stack; incorrect residual/attention propagation or numerics could silently degrade merge quality versus the prior LRP implementation.
Overview
merge_method: lrpnow uses eX-LRP with AttnLRP-style relevance propagation instead of simpler per-weight scoring, so importance reflects Transformer blocks (self-attention, MLP, norms) and residual paths with epsilon stabilization. Relevance is computed in one backward pass seeded from predicted token logits, then drives the existing merge flow (task vectors masked bylrp_scoresanddensity, with full density on norms/embed/head/bias).The scoring pipeline and
mergekitLRP merge task gain multimodal / optional-tensor handling, in-place merge math, periodic GC (“Iron-Man”) for large runs, and TurboQuant-related optimizations per the PR. Example LRP YAML and README are expanded for the new algorithm; configs still uselrp/lrp_scorespaths—no rename required, but merge behavior changes when precomputed scores come from the new propagator.Reviewed by Cursor Bugbot for commit 48efcab. Bugbot is set up for automated code reviews on this repo. Configure here.