Add Qwen3.5 architecture support#686
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
|
recheck |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1296196de9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if getattr(_text_config(config), "attention_bias", False): | ||
| res.extend( | ||
| WeightInfo(name=f"{prefix}.self_attn.{name}", optional=True) | ||
| for name in ("q_proj.bias", "k_proj.bias", "v_proj.bias") |
There was a problem hiding this comment.
Include the full-attention output bias
When attention_bias is enabled, Qwen3.5 full-attention layers also instantiate self_attn.o_proj.bias (o_proj is constructed with bias=config.attention_bias in the Transformers Qwen3.5 implementation), but this architecture only enumerates q/k/v biases. For any Qwen3.5 dense or MoE checkpoint with attention_bias=True, mergekit will never plan or write the existing output-projection bias, so merged outputs silently drop that tensor.
Useful? React with 👍 / 👎.
18c79d1 to
69f901f
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 69f901f. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a812d01a66
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| WeightInfo(name=f"{prefix}.self_attn.q_proj.weight", optional=True), | ||
| WeightInfo(name=f"{prefix}.self_attn.k_proj.weight", optional=True), | ||
| WeightInfo(name=f"{prefix}.self_attn.v_proj.weight", optional=True), | ||
| WeightInfo(name=f"{prefix}.self_attn.o_proj.weight", optional=True), |
There was a problem hiding this comment.
Include attention biases for MTP layers
When attention_bias=True and mtp_num_hidden_layers is enabled, the MTP full-attention block uses the same q/k/v/o projections as Qwen3.5 decoder attention, so checkpoints contain mtp.layers.N.self_attn.{q,k,v,o}_proj.bias tensors. The main decoder path now enumerates those biases, but the MTP architecture still only lists the weights and norms here, so merges will silently omit the MTP attention biases from the output checkpoint.
Useful? React with 👍 / 👎.

Summary
Testing
pytest tests/test_qwen35_architecture.py -qpytest -qNote
Medium Risk
Adds a new architecture mapping that drives weight discovery for Qwen3.5 models (dense, MoE, multimodal), so mistakes could cause missing/extra tensors or incorrect merges for these checkpoints, but the change is largely additive and gated by architecture name checks.
Overview
Adds first-class
Qwen3.5architecture support by routingarch_info_for_configto a newqwen35_architecture_for_configresolver when the configarchitectures[0]matches known Qwen3.5 dense/MoE names.Introduces
mergekit/architecture/qwen35.py, defining module weight layouts for the Qwen3.5 text decoder (including mixed linear/full attention and optional attention biases), MoE/shared-expert variants, optional MTP blocks, and the multimodal vision tower; it also sets multimodal tagalong files and correct vocab-size config keys.Adds
tests/test_qwen35_architecture.pyto assert full coverage oftransformersstate_dict keys for dense and MoE configs and to smoke-testpassthrough(dense) andlinear(MoE) merges end-to-end.Reviewed by Cursor Bugbot for commit a812d01. Bugbot is set up for automated code reviews on this repo. Configure here.