Yeah, it is WIP but figure that if nobody reports it, it might go under the radar with all the MoE changes. I tested v.28, master and dev. On dev non-tp gave me 1.2t/s speeds on 4x3090, like a model on CPU. Then I tried native TP as normal and it resulted in the error log. NCCL TP still output correctly.
Also for some reason encode_special_tokens still adds BOS token to all tokenizations. This means all sillytavern token bans and anything related is wrong. Even though it sends the request like this now. With encode_special_tokens disabled, <s> is tokenized correctly so I don't get what this setting is for. I can try to PR them to add the parameter to set it false but feel like this can burn any front end tokenizing via tabby.
2026-05-03 07:42:37.761 INFO: Headers: {'accept': '*/*', 'accept-encoding': 'gzip, deflate, br', 'authorization': 'Bearer
befed36be355afb56f593ff82e18dd93', 'content-length': '36', 'content-type': 'application/json', 'user-agent': 'node-fetch', 'x-api-key':
'befed36be355afb56f593ff82e18dd93', 'host': '192.168.1.211:5000', 'connection': 'close'}
2026-05-03 07:42:37.761 INFO: Body: {'text': '<s>', 'add_bos_token': False}
## Exception in child process
Traceback (most recent call last):
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_fn.py", line 81, in mp_model_worker
result = func(local_context, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_fn.py", line 215, in mp_model_forward
x = module.forward(x, params)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/modules/transformer.py", line 167, in forward
y = self.mlp.forward(y, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/modules/mlp.py", line 675, in forward
params["backend"].all_reduce(d)
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_backend.py", line 333, in all_reduce
ext.pg_all_reduce_cpu(
RuntimeError: Synchronization timeout
----------------------------------------
## Synchronization timeout in kernel: pg_all_reduce_cpu_kernel
----------------------------------------
## Exception in child process
Traceback (most recent call last):
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_fn.py", line 81, in mp_model_worker
result = func(local_context, *args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_fn.py", line 215, in mp_model_forward
x = module.forward(x, params)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/modules/gather.py", line 65, in forward
backend.gather(x, out_tensor, self.gather_devices, self.output_device, self.ldims)
File "/home/supermicro/miniconda3/envs/cuda12/lib/python3.11/site-packages/exllamav3/model/model_tp_backend.py", line 372, in gather
ext.pg_gather(
RuntimeError: Synchronization timeout
Yeah, it is WIP but figure that if nobody reports it, it might go under the radar with all the MoE changes. I tested v.28, master and dev. On dev non-tp gave me 1.2t/s speeds on 4x3090, like a model on CPU. Then I tried native TP as normal and it resulted in the error log. NCCL TP still output correctly.
Also for some reason encode_special_tokens still adds BOS token to all tokenizations. This means all sillytavern token bans and anything related is wrong. Even though it sends the request like this now. With encode_special_tokens disabled,
<s>is tokenized correctly so I don't get what this setting is for. I can try to PR them to add the parameter to set it false but feel like this can burn any front end tokenizing via tabby.