perf(qwen35): fixed-width verify graph for CUDA-graph replay by cheese-cakee · Pull Request #424 · Luce-Org/lucebox-hub

cheese-cakee · 2026-06-19T14:11:23Z

Summary

The kvflash spec-decode verify path already builds a step-invariant ggml graph. ggml_set_rows + kv_write_rows carry kv_start in data, a stride-256 flash-attention span keeps the node properties constant, and a persistent step arena keeps the graph at a stable address. That lets the ggml-cuda CUDA-graph cache replay it across decode steps instead of relaunching every kernel.

The one place that still breaks replay is the post-accept replay verify. Each round runs a main verify over the full draft width q_len, then (on the non-fast-rollback branch) a replay verify over commit_n tokens. Since commit_n changes from step to step, the replay builds a graph with different node dimensions than the main verify, so ggml-cuda has to recapture. ggml-cuda compares every node's struct plus each source ne/nb/data-ptr byte for byte, and recaptures on any mismatch, so alternating verify and replay never settles on one captured graph.

This PR adds an opt-in fixed-width verify so the replay reuses the main verify's graph.

Changes

server/src/common/dflash_target.h: DFlashTarget::verify_batch gets a trailing int pad_to = 0. The default keeps today's variable-width behavior.
server/src/qwen35/qwen35_dflash_target.cpp: implements it. The forward is built at n_tokens = max(pad_to, tokens.size()), the padding rows carry a zero embedding, and embed / argmax / cur_pos all use the real token count n_real.
server/src/qwen35/qwen35_backend.cpp: gated behind DFLASH_QWEN35_FIXED_VERIFY (off by default). When set, the chain replay verify is padded to q_len.
server/src/gemma4/gemma4_dflash_target.* and server/src/qwen35/qwen35_layer_split_dflash_target.*: accept and ignore pad_to, just to match the base signature.

How it works

With n_tokens = pad_to, the real rows 0..n_real-1 attend only to committed positions (pos < base_pos) and their own causal slots slots[0..q], which are all below n_real. Both the slot-space mask and the causal mask leave every padded column at -inf for the real rows, so those columns add exp(-inf) = 0 to the softmax denominator. The embeddings and positions for the real rows are unchanged. So the argmax for positions 0..n_real-1 comes out the same as an unpadded call.

The caller only pads the replay to q_len, which is the width the round's main verify already used, so under kvflash the slots for [committed, committed+q_len) are already resident. restore_kv() only restores SSM/conv state and never touches the pager, and slot_for returns the same slot for an already-resident position, so the padded replay allocates no new pool slot and triggers no eviction. The padded positions [committed+commit_n, committed+q_len) are never read again: cur_pos advances by n_real, and the next round's mask skips any slot at pos >= base_pos and overwrites it before attention.

Performance

No numbers in this PR yet. The change is correctness-preserving and off by default, so the default path is unchanged.

To evaluate the opt-in (RTX 3090, CUDA 12.8):

# correctness: same token stream with the flag on vs off, fixed prompt + seed
DFLASH_QWEN35_FIXED_VERIFY=1 <serve cmd ...>

# throughput A/B at a graph-eligible block size (see Limitations)
<serve cmd ...>                                 # baseline
DFLASH_QWEN35_FIXED_VERIFY=1 <serve cmd ...>    # fixed-width replay
# confirm graphs actually engage: compare against GGML_CUDA_DISABLE_GRAPHS=1

block_size	tok/s baseline	tok/s fixed-verify	recaptures/step
<= 8	maintainer	maintainer	maintainer

Limitations

ggml-cuda turns CUDA graphs off for any graph whose MUL_MAT_ID (the expert matmul) token batch ne[2] is larger than mmvq_mmid_max (about 8 on Turing+, see [TAG_MUL_MAT_ID_CUDA_GRAPHS] in ggml-cuda.cu). So the replay win only shows up when block_size stays within that limit. Past it the verify graph is not captured at all and padding would only add expert-matmul work. That is why this is opt-in and off by default, and why the perf table is scoped to block_size <= 8.
Only the chain replay site is wired. The tree-verify replay/bonus, the size-1 bonus verify, and the floor / tool-prefix / budget-close replays in qwen35_backend.cpp also vary in width. Padding those (especially the size-1 bonus) trades more MoE compute for replay and needs its own measurement, so it is left as follow-up.
The generic run_dflash_spec_decode loop (layer-split path) is unchanged, and its target ignores pad_to.

Verification

Default path is byte-identical with the flag off.
Reviewed for: real-row output staying identical (mask causality, masked columns contributing zero to softmax, argmax ordering); no committed-KV corruption and no extra eviction (restore_kv leaves the pager untouched, slot_for is idempotent for resident chunks, cur_pos uses n_real, the next-round mask excludes the pad slots); and signature/arity across all three overrides and every call site.
Maintainer A/B and correctness commands above for the opt-in path.

The kvflash spec-decode verify path already builds a step-invariant ggml graph (set_rows + kv_write_rows, stride-256 FA span, persistent step arena), so the ggml-cuda CUDA-graph cache can replay it across decode steps. But the post-accept replay verify runs at a variable width (commit_n), building a graph whose node dimensions differ from the q_len-wide main verify — every low-acceptance step forces a recapture, and alternating verify/replay never settles on one captured graph. Add an optional fixed-width path. verify_batch(..., pad_to) builds the forward at max(pad_to, tokens.size()) tokens; the padding rows carry a zero embedding and are masked out. Real rows attend only to committed positions and their own causal slots, so causality/masking excludes every padded column — the argmax consumed for the real positions is bit-identical to an unpadded call. The caller pads the replay to q_len so it reuses the main verify's graph; those slots are already resident from the same round's main verify, so this allocates no new pool slot and triggers no eviction. - DFlashTarget::verify_batch grows a trailing `pad_to = 0`; the default preserves the current variable-width behavior. Qwen35DFlashTarget implements it; the gemma4 and layer-split overrides accept and ignore it. - Gated behind DFLASH_QWEN35_FIXED_VERIFY (off by default). The win only lands when the verify graph is CUDA-graph-eligible: for an MoE target ggml-cuda disables graphs when the mul_mat_id token batch exceeds mmvq_mmid_max (~8 on Turing+), so a wider block_size would only pay the padded compute.

cubic-dev-ai

No issues found across 8 files

_{Re-trigger cubic}

cubic-dev-ai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qwen35): fixed-width verify graph for CUDA-graph replay#424

perf(qwen35): fixed-width verify graph for CUDA-graph replay#424
cheese-cakee wants to merge 1 commit into
Luce-Org:mainfrom
cheese-cakee:perf/qwen35-stepinvariant-verify-graph

cheese-cakee commented Jun 19, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cheese-cakee commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How it works

Performance

Limitations

Verification

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cheese-cakee commented Jun 19, 2026 •

edited

Loading