fix: multimodal-LM (Qwen3.5_5) support for ex-LRP by mann1x · Pull Request #1 · Tusm11/mergekit

mann1x · 2026-04-30T17:36:36Z

Summary

End-to-end fixes that make this PR's ex-LRP method work against multimodal models like Qwen3_5ForConditionalGeneration. Five minimal patches across mergekit/architecture/{base,auto}.py, mergekit/merge_methods/lrp.py, mergekit/config.py, and lrp_computer.py. Each patch includes a one-paragraph rationale in the commit body.

Without these patches, ex-LRP cannot be invoked on any multimodal LM (vision tower + language model + MTP heads); it fails at first model-load (pydantic), or at graph-execution (missing tensors in hybrid-attention layers), or at save (tied tensors), or in the LRP score computation (model_type not recognized).

What's tested

Validated end-to-end on Qwen3.5-4B with two source fine-tunes merged into base via mergekit-yaml driving the lrp method (this PR's recipe), with LRP relevance scores precomputed via the lrp_computer.py (also patched). All scoring done via lm_eval local-chat-completions/local-completions against llama-server running the Q6_K quantization, temperature 0, max_gen_toks=2048.

Variant	Recipe	Merger	Importance	HumanEval	MBPP
M4 — ex-LRP (THIS PR)	mergekit-yaml	mergekit (PR arcee-ai#682)	LRP	51.22%	49.40%
M5 — OMv2 + LRP signal	OMv2 (DARE-TIES + OBIM-lite + DAREx-q + EMR election)	dare_ties_merge.py	LRP (same scores)	53.05%	51.40%
M3 — OMv2 + Fisher	OMv2	dare_ties_merge.py	Fisher	57.93%	48.80%
M2 — OMv2 (no signal)	OMv2	dare_ties_merge.py	none	52.44%	49.40%
M1 — DARE-TIES baseline	DARE-TIES	dare_ties_merge.py	none	51.22%	47.00%

The M5 row is the apples-to-apples LRP comparison vs M4 — same LRP scores, different merger. M4 (ex-LRP via this PR) and M5 (OMv2 + LRP via a custom merger) bracket the merger effect; they share an identical importance signal.

Reproducer artefacts (HF)

All weights, GGUFs, READMEs, importance signals (Fisher .safetensors, LRP .safetensors), and the merge YAMLs / configs used in the test runs are published:

Qwen3.5-4B-M4-ex-LRP — this PR's recipe applied to the test sources. BF16 + Q6_K + lrp/ subdir with the multimodal-prefixed LRP scores. The lrp/lrp_config.yaml is the exact mergekit-yaml input.
Qwen3.5-4B-M5-OMv2-LRP — apples-to-apples comparison (same LRP signal, OMv2 recipe).
Qwen3.5-4B-M3-Fisher — Fisher signal twin (with the actual Fisher safetensors under fisher/).
Qwen3.5-4B-M2-OMv2, Qwen3.5-4B-M1-Dare-Ties — recipe-only baselines.

Each repo's README has the full 5-way comparison table for cross-reference.

Companion lxt patch

The lrp_computer.py change in this PR depends on lxt (LRP-eXplains-Transformers) being importable in environments without vit_torch. A separate one-line PR has been opened at rachtibat/LRP-eXplains-Transformers#42 wrapping that import in try/except.

Test plan

mergekit-yaml runs to completion against Qwen3.5-4B (multimodal: visual + language_model + mtp.layers)
BF16 shards saved without "tensor share storage" errors (tied lm_head/embed_tokens cloned)
LRP scores actually applied (tensor-name prefix matches via the language_model branch dispatch)
Visual tower + MTP heads remain base-passthrough (no LRP signal attempted there)
Hybrid-attention layers don't error on missing optional tensors (e.g. dt_bias only in some layers)
Output Q6_K quantizes cleanly via llama.cpp convert_hf_to_gguf.py + llama-quantize
HumanEval pass@1 + MBPP pass@1 land in the expected range vs same-architecture baselines

Note

Patches kept as small and orthogonal as possible; happy to split into separate commits if preferred. The lrp_computer.py change is the largest by line count because two related issues collapse into one file (model_type dispatch + tied-tensor save).

…, add LRP mask test, and clean up files

…t multiple architectures, and enforce explicit lrp_scores

…back, add lru_cache for memory safety, and clean up unused LRP parameters

…restore model state after LRP backward pass

…lxt to pyproject extras

…olate LRP score output directories, and patch notebook paths

…ft-padded attention masks, and fix pipeline device arguments

…process interpreter - Fix CUDA device fallback: Normalize device selection in LRPConfig.__post_init__ to gracefully fall back to CPU when CUDA is unavailable - Fix Qwen architecture dispatch: Use exact model_type matching instead of substring to avoid false positives (qwen3, qwen2_moe) - Fix subprocess interpreter: Use sys.executable instead of bare 'python' to ensure correct venv/conda environment

…ation, float16 underflow - Fix weight normalization: Remove total_weight division to respect user-supplied weights and match documented merge formula (Masked task vector × weight) - Fix cached LRP scores: Return deep copy from _load_lrp_scores to prevent cross-task tensor mutation and aliasing - Fix float16 underflow: Always use float32 for model loading to prevent gradient underflow in backward pass (float16 min normal ~6e-5 causes small gradients to zero) - Add warning message about float32 requirement for accurate LRP computation

The total_weight computation and zero-check were dead code after removing weight normalization. This misleading validation could confuse future maintainers into 'fixing' the apparent bug by re-introducing normalization, silently changing merge semantics. Removed: - total_weight = sum(self.model_weights.values()) - Zero-check validation The merge now clearly uses un-normalized per-model weights as documented.

…script location - Fix silent warnings: validate() now raises FileNotFoundError immediately if model paths don't exist, preventing confusing subprocess errors later - Fix script path resolution: compute_lrp_scores() now resolves lrp_computer.py relative to the pipeline script's directory using __file__, allowing the pipeline to be run from any working directory - Add clear error messages with actionable guidance for both issues

End-to-end fixes that make the ex-LRP method in this PR work against multimodal models like Qwen3_5ForConditionalGeneration. Validated on Qwen3.5-4B (jackrong-v2 + crow-4b → base) producing a clean Q6_K with HumanEval pass@1 = 51.22% and MBPP pass@1 = 49.40%. Five minimal patches: * mergekit/architecture/base.py Pydantic v2 forward-references in ConfiguredModuleArchitecture and ConfiguredModelArchitecture aren't resolved eagerly; add model_rebuild() at module load. Without this, the first model load fails with PydanticUserError: not fully defined. * mergekit/architecture/auto.py Make 'optional' layer-aware. Hybrid-attention archs (Qwen3.5_5 alternates full / linear attention) have tensors like dt_bias only in some layers. The original _wi flagged optional based on a single layer's presence; widen to true if missing in *any* layer. * mergekit/merge_methods/lrp.py Replace strict raise with base-passthrough when LRP scores are missing for a tensor. Multimodal LRP is computed only on the language_model branch; vision tower / MTP heads have no relevance signal and should retain base weights instead of failing the merge. * mergekit/config.py Allow `str` in the ParameterSetting union so per-source params like `lrp_scores: \"/path/to/scores.safetensors\"` parse cleanly via mergekit-yaml. * lrp_computer.py - Recognize qwen3_5_text inner-LM model_type (the inner LM of Qwen3_5ForConditionalGeneration) and dispatch AttnLRP against it. - Clone tied tensors (lm_head.weight ↔ model.embed_tokens.weight) by data_ptr() before save_file, since safetensors save rejects shared-storage tensors. Reproducer + LRP signal artefacts published at https://huggingface.co/ManniX-ITA/Qwen3.5-4B-M4-ex-LRP Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tusm11 · 2026-05-01T13:31:37Z

@copilot resolve the merge conflicts in this pull request

Tusm11 and others added 13 commits April 25, 2026 21:31

docs: Update README with eX-LRP documentation and features

90a9d86

Address PR feedback: wire lrp_scores, migrate to lxt for true AttnLRP…

0125b2e

…, add LRP mask test, and clean up files

Fix LRP bugs: add safetensors support, implement global cache, suppor…

3e92daf

…t multiple architectures, and enforce explicit lrp_scores

Fix pipeline argument parsing, enforce strict LRP scores without fall…

4ac9c1d

…back, add lru_cache for memory safety, and clean up unused LRP parameters

Fix test fallback logic, use dynamic mergekit-yaml path, and cleanly …

a899412

…restore model state after LRP backward pass

Guarantee gradient checkpointing restoration via try/finally and add …

d6e8dd1

…lxt to pyproject extras

Fix build_mask import, correctly trace embedding gradients in LRP, is…

ae32700

…olate LRP score output directories, and patch notebook paths

Fix tied embeddings in LRP, remove incompatible 4-bit loading, fix le…

26571b8

…ft-padded attention masks, and fix pipeline device arguments

mann1x mentioned this pull request Apr 30, 2026

fix(efficient): make vit_torch import optional for text-only LRP environments rachtibat/LRP-eXplains-Transformers#42

Open

4 tasks

Tusm11 force-pushed the feature/ex-lrp branch from cfcfc82 to 8d989f6 Compare May 1, 2026 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: multimodal-LM (Qwen3.5_5) support for ex-LRP#1

fix: multimodal-LM (Qwen3.5_5) support for ex-LRP#1
mann1x wants to merge 13 commits into
Tusm11:feature/ex-lrpfrom
mann1x:fix/multimodal-lm-support

mann1x commented Apr 30, 2026 •

edited

Loading

Uh oh!

Tusm11 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mann1x commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's tested

Reproducer artefacts (HF)

Companion lxt patch

Test plan

Note

Uh oh!

Tusm11 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mann1x commented Apr 30, 2026 •

edited

Loading