Skip to content

perf(qwen35): fixed-width verify graph for CUDA-graph replay#424

Open
cheese-cakee wants to merge 1 commit into
Luce-Org:mainfrom
cheese-cakee:perf/qwen35-stepinvariant-verify-graph
Open

perf(qwen35): fixed-width verify graph for CUDA-graph replay#424
cheese-cakee wants to merge 1 commit into
Luce-Org:mainfrom
cheese-cakee:perf/qwen35-stepinvariant-verify-graph

Conversation

@cheese-cakee

@cheese-cakee cheese-cakee commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Summary

The kvflash spec-decode verify path already builds a step-invariant ggml graph. ggml_set_rows + kv_write_rows carry kv_start in data, a stride-256 flash-attention span keeps the node properties constant, and a persistent step arena keeps the graph at a stable address. That lets the ggml-cuda CUDA-graph cache replay it across decode steps instead of relaunching every kernel.

The one place that still breaks replay is the post-accept replay verify. Each round runs a main verify over the full draft width q_len, then (on the non-fast-rollback branch) a replay verify over commit_n tokens. Since commit_n changes from step to step, the replay builds a graph with different node dimensions than the main verify, so ggml-cuda has to recapture. ggml-cuda compares every node's struct plus each source ne/nb/data-ptr byte for byte, and recaptures on any mismatch, so alternating verify and replay never settles on one captured graph.

This PR adds an opt-in fixed-width verify so the replay reuses the main verify's graph.

Changes

  • server/src/common/dflash_target.h: DFlashTarget::verify_batch gets a trailing int pad_to = 0. The default keeps today's variable-width behavior.
  • server/src/qwen35/qwen35_dflash_target.cpp: implements it. The forward is built at n_tokens = max(pad_to, tokens.size()), the padding rows carry a zero embedding, and embed / argmax / cur_pos all use the real token count n_real.
  • server/src/qwen35/qwen35_backend.cpp: gated behind DFLASH_QWEN35_FIXED_VERIFY (off by default). When set, the chain replay verify is padded to q_len.
  • server/src/gemma4/gemma4_dflash_target.* and server/src/qwen35/qwen35_layer_split_dflash_target.*: accept and ignore pad_to, just to match the base signature.

How it works

With n_tokens = pad_to, the real rows 0..n_real-1 attend only to committed positions (pos < base_pos) and their own causal slots slots[0..q], which are all below n_real. Both the slot-space mask and the causal mask leave every padded column at -inf for the real rows, so those columns add exp(-inf) = 0 to the softmax denominator. The embeddings and positions for the real rows are unchanged. So the argmax for positions 0..n_real-1 comes out the same as an unpadded call.

The caller only pads the replay to q_len, which is the width the round's main verify already used, so under kvflash the slots for [committed, committed+q_len) are already resident. restore_kv() only restores SSM/conv state and never touches the pager, and slot_for returns the same slot for an already-resident position, so the padded replay allocates no new pool slot and triggers no eviction. The padded positions [committed+commit_n, committed+q_len) are never read again: cur_pos advances by n_real, and the next round's mask skips any slot at pos >= base_pos and overwrites it before attention.

Performance

No numbers in this PR yet. The change is correctness-preserving and off by default, so the default path is unchanged.

To evaluate the opt-in (RTX 3090, CUDA 12.8):

# correctness: same token stream with the flag on vs off, fixed prompt + seed
DFLASH_QWEN35_FIXED_VERIFY=1 <serve cmd ...>

# throughput A/B at a graph-eligible block size (see Limitations)
<serve cmd ...>                                 # baseline
DFLASH_QWEN35_FIXED_VERIFY=1 <serve cmd ...>    # fixed-width replay
# confirm graphs actually engage: compare against GGML_CUDA_DISABLE_GRAPHS=1
block_size tok/s baseline tok/s fixed-verify recaptures/step
<= 8 maintainer maintainer maintainer

Limitations

  • ggml-cuda turns CUDA graphs off for any graph whose MUL_MAT_ID (the expert matmul) token batch ne[2] is larger than mmvq_mmid_max (about 8 on Turing+, see [TAG_MUL_MAT_ID_CUDA_GRAPHS] in ggml-cuda.cu). So the replay win only shows up when block_size stays within that limit. Past it the verify graph is not captured at all and padding would only add expert-matmul work. That is why this is opt-in and off by default, and why the perf table is scoped to block_size <= 8.
  • Only the chain replay site is wired. The tree-verify replay/bonus, the size-1 bonus verify, and the floor / tool-prefix / budget-close replays in qwen35_backend.cpp also vary in width. Padding those (especially the size-1 bonus) trades more MoE compute for replay and needs its own measurement, so it is left as follow-up.
  • The generic run_dflash_spec_decode loop (layer-split path) is unchanged, and its target ignores pad_to.

Verification

  • Default path is byte-identical with the flag off.
  • Reviewed for: real-row output staying identical (mask causality, masked columns contributing zero to softmax, argmax ordering); no committed-KV corruption and no extra eviction (restore_kv leaves the pager untouched, slot_for is idempotent for resident chunks, cur_pos uses n_real, the next-round mask excludes the pad slots); and signature/arity across all three overrides and every call site.
  • Maintainer A/B and correctness commands above for the opt-in path.

The kvflash spec-decode verify path already builds a step-invariant ggml graph
(set_rows + kv_write_rows, stride-256 FA span, persistent step arena), so the
ggml-cuda CUDA-graph cache can replay it across decode steps. But the post-accept
replay verify runs at a variable width (commit_n), building a graph whose node
dimensions differ from the q_len-wide main verify — every low-acceptance step
forces a recapture, and alternating verify/replay never settles on one captured
graph.

Add an optional fixed-width path. verify_batch(..., pad_to) builds the forward at
max(pad_to, tokens.size()) tokens; the padding rows carry a zero embedding and are
masked out. Real rows attend only to committed positions and their own causal
slots, so causality/masking excludes every padded column — the argmax consumed for
the real positions is bit-identical to an unpadded call. The caller pads the replay
to q_len so it reuses the main verify's graph; those slots are already resident from
the same round's main verify, so this allocates no new pool slot and triggers no
eviction.

- DFlashTarget::verify_batch grows a trailing `pad_to = 0`; the default preserves
  the current variable-width behavior. Qwen35DFlashTarget implements it; the gemma4
  and layer-split overrides accept and ignore it.
- Gated behind DFLASH_QWEN35_FIXED_VERIFY (off by default). The win only lands when
  the verify graph is CUDA-graph-eligible: for an MoE target ggml-cuda disables
  graphs when the mul_mat_id token batch exceeds mmvq_mmid_max (~8 on Turing+), so a
  wider block_size would only pay the padded compute.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Re-trigger cubic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant