Describe the bug:
Cross-link: theroyallab/tabbyAPI#428
When a streaming job is cancelled mid-generation (the frontend disconnects), the async generator appears to deadlock: no job ever completes afterward. Newly enqueued jobs are accepted but never produce tokens, GPU memory stays fully allocated, and GPU utilization drops to ~0%. Only a full reload recovers it. The serving frontend stays otherwise responsive (e.g. /v1/models keeps returning 200), so it is easy to miss.
Note: commit a03f0ff ("AsyncGenerator: Ensure cancel request is forwarded but don't crash if frontend breaks contract") is already present in 0.0.42, and 0.0.43 has no further generator/cancel changes — so this path still wedges. It looks like the cancelled job isn't reliably reaped from the batch scheduler, stalling the whole queue.
Reproduction steps:
- Load a 27B exl3 model (3.08bpw), 2× GPU tensor split, continuous batching (max_batch_size=6, cache_size=327680, max_seq_len=81920).
- Drive streaming generations through a frontend (here: tabbyAPI).
- Cancel a request mid-generation (client disconnect / "stop" / tab close).
- Subsequent jobs are accepted but never generate any tokens.
Full repro, logs and config: theroyallab/tabbyAPI#428
Expected behavior:
A cancelled job should be reaped from the batch scheduler; remaining and newly enqueued jobs should continue to generate normally.
Environment / versions:
exllamav3 0.0.42 (source commit 595d6c4)
torch 2.11.0+cu128, CUDA 12.8, Python 3.11.15
2× GPU (12GB + 8GB), tensor split
model: 27B exl3 @ 3.08bpw, continuous batching
Logs / Additional context
Full logs and config are in theroyallab/tabbyAPI#428
Describe the bug:
Cross-link: theroyallab/tabbyAPI#428
When a streaming job is cancelled mid-generation (the frontend disconnects), the async generator appears to deadlock: no job ever completes afterward. Newly enqueued jobs are accepted but never produce tokens, GPU memory stays fully allocated, and GPU utilization drops to ~0%. Only a full reload recovers it. The serving frontend stays otherwise responsive (e.g.
/v1/modelskeeps returning 200), so it is easy to miss.Note: commit
a03f0ff("AsyncGenerator: Ensure cancel request is forwarded but don't crash if frontend breaks contract") is already present in 0.0.42, and0.0.43has no further generator/cancel changes — so this path still wedges. It looks like the cancelled job isn't reliably reaped from the batch scheduler, stalling the whole queue.Reproduction steps:
Full repro, logs and config: theroyallab/tabbyAPI#428
Expected behavior:
A cancelled job should be reaped from the batch scheduler; remaining and newly enqueued jobs should continue to generate normally.
Environment / versions:
exllamav3 0.0.42 (source commit 595d6c4)
torch 2.11.0+cu128, CUDA 12.8, Python 3.11.15
2× GPU (12GB + 8GB), tensor split
model: 27B exl3 @ 3.08bpw, continuous batching
Logs / Additional context
Full logs and config are in theroyallab/tabbyAPI#428