Skip to content

Add Qwen3.5 architecture support#686

Open
ZhangYiqun018 wants to merge 4 commits into
arcee-ai:mainfrom
ZhangYiqun018:qwen35-architecture-support
Open

Add Qwen3.5 architecture support#686
ZhangYiqun018 wants to merge 4 commits into
arcee-ai:mainfrom
ZhangYiqun018:qwen35-architecture-support

Conversation

@ZhangYiqun018

@ZhangYiqun018 ZhangYiqun018 commented May 7, 2026

Copy link
Copy Markdown

Summary

  • add architecture support for Qwen3.5 dense and MoE models
  • cover Qwen3.5 multimodal wrapper weights, text decoder, vision tower, mixed linear/full attention, shared experts, and MTP weights
  • add Qwen3.5 architecture tests for dense passthrough and MoE linear merges

Testing

  • pytest tests/test_qwen35_architecture.py -q
  • pytest -q
  • verified official full-precision Qwen3.5 dense/MoE model index coverage with zero missing tensors

Note

Medium Risk
Adds a new architecture mapping that drives weight discovery for Qwen3.5 models (dense, MoE, multimodal), so mistakes could cause missing/extra tensors or incorrect merges for these checkpoints, but the change is largely additive and gated by architecture name checks.

Overview
Adds first-class Qwen3.5 architecture support by routing arch_info_for_config to a new qwen35_architecture_for_config resolver when the config architectures[0] matches known Qwen3.5 dense/MoE names.

Introduces mergekit/architecture/qwen35.py, defining module weight layouts for the Qwen3.5 text decoder (including mixed linear/full attention and optional attention biases), MoE/shared-expert variants, optional MTP blocks, and the multimodal vision tower; it also sets multimodal tagalong files and correct vocab-size config keys.

Adds tests/test_qwen35_architecture.py to assert full coverage of transformers state_dict keys for dense and MoE configs and to smoke-test passthrough (dense) and linear (MoE) merges end-to-end.

Reviewed by Cursor Bugbot for commit a812d01. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@ZhangYiqun018

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@ZhangYiqun018

Copy link
Copy Markdown
Author

recheck

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1296196de9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread mergekit/architecture/qwen35.py Outdated
if getattr(_text_config(config), "attention_bias", False):
res.extend(
WeightInfo(name=f"{prefix}.self_attn.{name}", optional=True)
for name in ("q_proj.bias", "k_proj.bias", "v_proj.bias")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include the full-attention output bias

When attention_bias is enabled, Qwen3.5 full-attention layers also instantiate self_attn.o_proj.bias (o_proj is constructed with bias=config.attention_bias in the Transformers Qwen3.5 implementation), but this architecture only enumerates q/k/v biases. For any Qwen3.5 dense or MoE checkpoint with attention_bias=True, mergekit will never plan or write the existing output-projection bias, so merged outputs silently drop that tensor.

Useful? React with 👍 / 👎.

Comment thread mergekit/architecture/qwen35.py Outdated
@ZhangYiqun018 ZhangYiqun018 force-pushed the qwen35-architecture-support branch from 18c79d1 to 69f901f Compare May 7, 2026 07:14

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 69f901f. Configure here.

Comment thread mergekit/architecture/qwen35.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a812d01a66

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +206
WeightInfo(name=f"{prefix}.self_attn.q_proj.weight", optional=True),
WeightInfo(name=f"{prefix}.self_attn.k_proj.weight", optional=True),
WeightInfo(name=f"{prefix}.self_attn.v_proj.weight", optional=True),
WeightInfo(name=f"{prefix}.self_attn.o_proj.weight", optional=True),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include attention biases for MTP layers

When attention_bias=True and mtp_num_hidden_layers is enabled, the MTP full-attention block uses the same q/k/v/o projections as Qwen3.5 decoder attention, so checkpoints contain mtp.layers.N.self_attn.{q,k,v,o}_proj.bias tensors. The main decoder path now enumerates those biases, but the MTP architecture still only lists the weights and norms here, so merges will silently omit the MTP attention biases from the output checkpoint.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant