Skip to content

ExLlamaV3: SDPA paged-KV fallback indexes cache with block_table on the wrong GPU in autosplit mode #196

Description

@dnhkng

Summary

I hit a multi-GPU autosplit bug in ExLlamaV3's SDPA paged-KV fallback.

In flash_attn_with_kvcache_sdpa_fallback, block_table can live on a different CUDA device than the local cache shard (k_cache / v_cache). The fallback then indexes k_cache directly with indices from block_table, which raises a device mismatch error.

This showed up while debugging a separate FlashAttention relayer crash, but this fallback bug is independently real and appears to be a straightforward correctness issue in autosplit mode.

Environment

  • ExLlamaV3 local tree: /home/grace/exllamav3
  • Newer local branch/runtime than the older stack I had previously used for MiniMax-M2.5 runs
  • Autosplit across 2x NVIDIA GH200 96GB
  • CUDA 12.8 container runtime
  • Model: MiniMax-M2.7 EXL3 5.0bpw

Failure

Failure site:

  • exllamav3/modules/attn.py
  • function: flash_attn_with_kvcache_sdpa_fallback

Observed error:

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)

Relevant line before patch:

phys_blocks = block_table[b, :num_blocks_needed]
k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)

In autosplit mode, block_table[b, ...] was not guaranteed to be on k_cache.device.

Minimal fix

This local patch fixed the issue for me:

phys_blocks = block_table[b, :num_blocks_needed].to(k_cache.device, non_blocking = True)

Then indexing proceeds normally:

k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)
v_buf = v_cache[phys_blocks].reshape(-1, nheads_k, headdim)

Why I think this is upstream-worthy

  • This is not specific to my experiment logic.
  • It is a device-placement bug in the fallback itself.
  • The fix is small and local.
  • It only became visible because I forced SDPA fallback while working around a separate FlashAttention issue.

Notes

I used:

EXL3_FORCE_PAGED_SDPA=1

to route around a different illegal-memory-access bug in the FlashAttention paged-KV path. Once that workaround was enabled, this autosplit device mismatch became the next blocker.

Suggested issue title

SDPA paged-KV fallback uses block_table indices from the wrong device in autosplit multi-GPU runs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions