Summary
I hit a multi-GPU autosplit bug in ExLlamaV3's SDPA paged-KV fallback.
In flash_attn_with_kvcache_sdpa_fallback, block_table can live on a different CUDA device than the local cache shard (k_cache / v_cache). The fallback then indexes k_cache directly with indices from block_table, which raises a device mismatch error.
This showed up while debugging a separate FlashAttention relayer crash, but this fallback bug is independently real and appears to be a straightforward correctness issue in autosplit mode.
Environment
- ExLlamaV3 local tree:
/home/grace/exllamav3
- Newer local branch/runtime than the older stack I had previously used for MiniMax-M2.5 runs
- Autosplit across 2x NVIDIA GH200 96GB
- CUDA 12.8 container runtime
- Model: MiniMax-M2.7 EXL3 5.0bpw
Failure
Failure site:
exllamav3/modules/attn.py
- function:
flash_attn_with_kvcache_sdpa_fallback
Observed error:
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)
Relevant line before patch:
phys_blocks = block_table[b, :num_blocks_needed]
k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)
In autosplit mode, block_table[b, ...] was not guaranteed to be on k_cache.device.
Minimal fix
This local patch fixed the issue for me:
phys_blocks = block_table[b, :num_blocks_needed].to(k_cache.device, non_blocking = True)
Then indexing proceeds normally:
k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)
v_buf = v_cache[phys_blocks].reshape(-1, nheads_k, headdim)
Why I think this is upstream-worthy
- This is not specific to my experiment logic.
- It is a device-placement bug in the fallback itself.
- The fix is small and local.
- It only became visible because I forced SDPA fallback while working around a separate FlashAttention issue.
Notes
I used:
to route around a different illegal-memory-access bug in the FlashAttention paged-KV path. Once that workaround was enabled, this autosplit device mismatch became the next blocker.
Suggested issue title
SDPA paged-KV fallback uses block_table indices from the wrong device in autosplit multi-GPU runs
Summary
I hit a multi-GPU autosplit bug in ExLlamaV3's SDPA paged-KV fallback.
In
flash_attn_with_kvcache_sdpa_fallback,block_tablecan live on a different CUDA device than the local cache shard (k_cache/v_cache). The fallback then indexesk_cachedirectly with indices fromblock_table, which raises a device mismatch error.This showed up while debugging a separate FlashAttention relayer crash, but this fallback bug is independently real and appears to be a straightforward correctness issue in autosplit mode.
Environment
/home/grace/exllamav3Failure
Failure site:
exllamav3/modules/attn.pyflash_attn_with_kvcache_sdpa_fallbackObserved error:
Relevant line before patch:
In autosplit mode,
block_table[b, ...]was not guaranteed to be onk_cache.device.Minimal fix
This local patch fixed the issue for me:
Then indexing proceeds normally:
Why I think this is upstream-worthy
Notes
I used:
to route around a different illegal-memory-access bug in the FlashAttention paged-KV path. Once that workaround was enabled, this autosplit device mismatch became the next blocker.
Suggested issue title
SDPA paged-KV fallback uses block_table indices from the wrong device in autosplit multi-GPU runs