ExLlamaV3: SDPA paged-KV fallback indexes cache with block_table on the wrong GPU in autosplit mode

## Summary

I hit a multi-GPU autosplit bug in ExLlamaV3's SDPA paged-KV fallback.

In `flash_attn_with_kvcache_sdpa_fallback`, `block_table` can live on a different CUDA device than the local cache shard (`k_cache` / `v_cache`). The fallback then indexes `k_cache` directly with indices from `block_table`, which raises a device mismatch error.

This showed up while debugging a separate FlashAttention relayer crash, but this fallback bug is independently real and appears to be a straightforward correctness issue in autosplit mode.

## Environment

- ExLlamaV3 local tree: `/home/grace/exllamav3`
- Newer local branch/runtime than the older stack I had previously used for MiniMax-M2.5 runs
- Autosplit across 2x NVIDIA GH200 96GB
- CUDA 12.8 container runtime
- Model: MiniMax-M2.7 EXL3 5.0bpw

## Failure

Failure site:

- `exllamav3/modules/attn.py`
- function: `flash_attn_with_kvcache_sdpa_fallback`

Observed error:

```text
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)
```

Relevant line before patch:

```python
phys_blocks = block_table[b, :num_blocks_needed]
k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)
```

In autosplit mode, `block_table[b, ...]` was not guaranteed to be on `k_cache.device`.

## Minimal fix

This local patch fixed the issue for me:

```python
phys_blocks = block_table[b, :num_blocks_needed].to(k_cache.device, non_blocking = True)
```

Then indexing proceeds normally:

```python
k_buf = k_cache[phys_blocks].reshape(-1, nheads_k, headdim)
v_buf = v_cache[phys_blocks].reshape(-1, nheads_k, headdim)
```

## Why I think this is upstream-worthy

- This is not specific to my experiment logic.
- It is a device-placement bug in the fallback itself.
- The fix is small and local.
- It only became visible because I forced SDPA fallback while working around a separate FlashAttention issue.

## Notes

I used:

```bash
EXL3_FORCE_PAGED_SDPA=1
```

to route around a different illegal-memory-access bug in the FlashAttention paged-KV path. Once that workaround was enabled, this autosplit device mismatch became the next blocker.

## Suggested issue title

`SDPA paged-KV fallback uses block_table indices from the wrong device in autosplit multi-GPU runs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ExLlamaV3: SDPA paged-KV fallback indexes cache with block_table on the wrong GPU in autosplit mode #196

Summary

Environment

Failure

Minimal fix

Why I think this is upstream-worthy

Notes

Suggested issue title

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ExLlamaV3: SDPA paged-KV fallback indexes cache with block_table on the wrong GPU in autosplit mode #196

Description

Summary

Environment

Failure

Minimal fix

Why I think this is upstream-worthy

Notes

Suggested issue title

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions