perf(qwen35): fixed-width verify graph for CUDA-graph replay#424
Open
cheese-cakee wants to merge 1 commit into
Open
perf(qwen35): fixed-width verify graph for CUDA-graph replay#424cheese-cakee wants to merge 1 commit into
cheese-cakee wants to merge 1 commit into
Conversation
The kvflash spec-decode verify path already builds a step-invariant ggml graph (set_rows + kv_write_rows, stride-256 FA span, persistent step arena), so the ggml-cuda CUDA-graph cache can replay it across decode steps. But the post-accept replay verify runs at a variable width (commit_n), building a graph whose node dimensions differ from the q_len-wide main verify — every low-acceptance step forces a recapture, and alternating verify/replay never settles on one captured graph. Add an optional fixed-width path. verify_batch(..., pad_to) builds the forward at max(pad_to, tokens.size()) tokens; the padding rows carry a zero embedding and are masked out. Real rows attend only to committed positions and their own causal slots, so causality/masking excludes every padded column — the argmax consumed for the real positions is bit-identical to an unpadded call. The caller pads the replay to q_len so it reuses the main verify's graph; those slots are already resident from the same round's main verify, so this allocates no new pool slot and triggers no eviction. - DFlashTarget::verify_batch grows a trailing `pad_to = 0`; the default preserves the current variable-width behavior. Qwen35DFlashTarget implements it; the gemma4 and layer-split overrides accept and ignore it. - Gated behind DFLASH_QWEN35_FIXED_VERIFY (off by default). The win only lands when the verify graph is CUDA-graph-eligible: for an MoE target ggml-cuda disables graphs when the mul_mat_id token batch exceeds mmvq_mmid_max (~8 on Turing+), so a wider block_size would only pay the padded compute.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The kvflash spec-decode verify path already builds a step-invariant ggml graph.
ggml_set_rows+kv_write_rowscarrykv_startin data, a stride-256 flash-attention span keeps the node properties constant, and a persistent step arena keeps the graph at a stable address. That lets the ggml-cuda CUDA-graph cache replay it across decode steps instead of relaunching every kernel.The one place that still breaks replay is the post-accept replay verify. Each round runs a main verify over the full draft width
q_len, then (on the non-fast-rollback branch) a replay verify overcommit_ntokens. Sincecommit_nchanges from step to step, the replay builds a graph with different node dimensions than the main verify, so ggml-cuda has to recapture. ggml-cuda compares every node's struct plus each sourcene/nb/data-ptr byte for byte, and recaptures on any mismatch, so alternating verify and replay never settles on one captured graph.This PR adds an opt-in fixed-width verify so the replay reuses the main verify's graph.
Changes
server/src/common/dflash_target.h:DFlashTarget::verify_batchgets a trailingint pad_to = 0. The default keeps today's variable-width behavior.server/src/qwen35/qwen35_dflash_target.cpp: implements it. The forward is built atn_tokens = max(pad_to, tokens.size()), the padding rows carry a zero embedding, and embed / argmax /cur_posall use the real token countn_real.server/src/qwen35/qwen35_backend.cpp: gated behindDFLASH_QWEN35_FIXED_VERIFY(off by default). When set, the chain replay verify is padded toq_len.server/src/gemma4/gemma4_dflash_target.*andserver/src/qwen35/qwen35_layer_split_dflash_target.*: accept and ignorepad_to, just to match the base signature.How it works
With
n_tokens = pad_to, the real rows0..n_real-1attend only to committed positions (pos < base_pos) and their own causal slotsslots[0..q], which are all belown_real. Both the slot-space mask and the causal mask leave every padded column at-inffor the real rows, so those columns addexp(-inf) = 0to the softmax denominator. The embeddings and positions for the real rows are unchanged. So the argmax for positions0..n_real-1comes out the same as an unpadded call.The caller only pads the replay to
q_len, which is the width the round's main verify already used, so under kvflash the slots for[committed, committed+q_len)are already resident.restore_kv()only restores SSM/conv state and never touches the pager, andslot_forreturns the same slot for an already-resident position, so the padded replay allocates no new pool slot and triggers no eviction. The padded positions[committed+commit_n, committed+q_len)are never read again:cur_posadvances byn_real, and the next round's mask skips any slot atpos >= base_posand overwrites it before attention.Performance
No numbers in this PR yet. The change is correctness-preserving and off by default, so the default path is unchanged.
To evaluate the opt-in (RTX 3090, CUDA 12.8):
Limitations
MUL_MAT_ID(the expert matmul) token batchne[2]is larger thanmmvq_mmid_max(about 8 on Turing+, see[TAG_MUL_MAT_ID_CUDA_GRAPHS]inggml-cuda.cu). So the replay win only shows up whenblock_sizestays within that limit. Past it the verify graph is not captured at all and padding would only add expert-matmul work. That is why this is opt-in and off by default, and why the perf table is scoped toblock_size <= 8.qwen35_backend.cppalso vary in width. Padding those (especially the size-1 bonus) trades more MoE compute for replay and needs its own measurement, so it is left as follow-up.run_dflash_spec_decodeloop (layer-split path) is unchanged, and its target ignorespad_to.Verification
restore_kvleaves the pager untouched,slot_foris idempotent for resident chunks,cur_posusesn_real, the next-round mask excludes the pad slots); and signature/arity across all three overrides and every call site.