There is a chat completions error caused when running Gemma4. Possibly with other vision models but I haven't tested it. Sending images now crashes the generator.
- Regression Introduced in: v0.0.35
- Last known working version: v0.0.34
- Still broken v0.0.43
- Error: AssertionError: self.position == stashed["position"] in sliding_attn.py:69
Reproduction:
Running this model Gemma4-26b-A4B-it 5.10bpw works mostly fine. However, when you send an image to the vision model, it will first respond accordingly. But the message immediately following will throw the error. A retry will fix the issue an the model will begin chatting once more.
However, sending a second image will throw the error once more and this error will NOT go away until the entire backend is restarted.
Confirmed by:
Running TabbyAPI on the latest version (using exllama v0.0.43). I tested it on both Sillytavern and OpenWebUI to confirm. Rolling back TabbyAPI to commit #fef811d (which uses exl3 v0.0.34) confirms vision works properly without issues.
Traceback:
2026-06-17 22:22:54.725 INFO: Received chat completion streaming request 77a7704e400d48b6957e39e5963a80e4
2026-06-17 22:22:54.727 ERROR: FATAL ERROR with generation. Attempting to recreate the generator. If this fails, please restart the
server.
2026-06-17 22:22:54.728 WARNING: Immediately terminating all jobs. Clients will have their requests cancelled.
2026-06-17 22:22:54.728 ERROR: Error during chat completion
2026-06-17 22:22:54.732 ERROR: Error Traceback (most recent call last):
2026-06-17 22:22:54.732 ERROR: File "./tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 573, in
stream_generate_chat_completion
2026-06-17 22:22:54.732 ERROR: raise generation
2026-06-17 22:22:54.732 ERROR: File "./tabbyAPI/endpoints/OAI/utils/chat_completion.py", line 396, in
_chat_stream_collector
2026-06-17 22:22:54.732 ERROR: async for generation in new_generation:
2026-06-17 22:22:54.732 ERROR: File "./tabbyAPI/backends/exllamav3/model.py", line 885, in stream_generate
2026-06-17 22:22:54.732 ERROR: async for generation_chunk in self.generate_gen(
2026-06-17 22:22:54.732 ERROR: File "./tabbyAPI/backends/exllamav3/model.py", line 1258, in generate_gen
2026-06-17 22:22:54.732 ERROR: raise ex
2026-06-17 22:22:54.732 ERROR: File "./tabbyAPI/backends/exllamav3/model.py", line 1188, in generate_gen
2026-06-17 22:22:54.732 ERROR: async for result in job:
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/generator/async_generator.py", line 109, in __aiter__
2026-06-17 22:22:54.732 ERROR: raise result
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/generator/async_generator.py", line 27, in
_run_iteration
2026-06-17 22:22:54.732 ERROR: results = self.generator.iterate()
2026-06-17 22:22:54.732 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
2026-06-17 22:22:54.732 ERROR: return func(*args, **kwargs)
2026-06-17 22:22:54.732 ERROR: ^^^^^^^^^^^^^^^^^^^^^
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/generator/generator.py", line 341, in iterate
2026-06-17 22:22:54.732 ERROR: self.iterate_start_jobs(results)
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/generator/generator.py", line 943, in iterate_start_jobs
2026-06-17 22:22:54.732 ERROR: job.allocate_pages()
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/generator/job.py", line 1209, in allocate_pages
2026-06-17 22:22:54.732 ERROR: self.recurrent_state = self.generator.cache.new_from_stashed(
2026-06-17 22:22:54.732 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/cache/cache.py", line 323, in new_from_stashed
2026-06-17 22:22:54.732 ERROR: return self.recurrent_state_cls(
2026-06-17 22:22:54.732 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/modules/sliding_attn.py", line 39, in __init__
2026-06-17 22:22:54.732 ERROR: self.unstash(stashed)
2026-06-17 22:22:54.732 ERROR: File
"./tabbyAPI/venv/lib/python3.12/site-packages/exllamav3/modules/sliding_attn.py", line 69, in unstash
2026-06-17 22:22:54.732 ERROR: assert self.position == stashed["position"]
2026-06-17 22:22:54.732 ERROR: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-06-17 22:22:54.732 ERROR: AssertionError
2026-06-17 22:22:54.740 ERROR: Sent to request: Chat completion aborted. Please check the server console.
Tested on: Ubuntu version Release 26.04.
There is a chat completions error caused when running Gemma4. Possibly with other vision models but I haven't tested it. Sending images now crashes the generator.
Reproduction:
Running this model Gemma4-26b-A4B-it 5.10bpw works mostly fine. However, when you send an image to the vision model, it will first respond accordingly. But the message immediately following will throw the error. A retry will fix the issue an the model will begin chatting once more.
However, sending a second image will throw the error once more and this error will NOT go away until the entire backend is restarted.
Confirmed by:
Running TabbyAPI on the latest version (using exllama v0.0.43). I tested it on both Sillytavern and OpenWebUI to confirm. Rolling back TabbyAPI to commit #fef811d (which uses exl3 v0.0.34) confirms vision works properly without issues.
Traceback:
Tested on: Ubuntu version Release 26.04.