You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
vLLM uses piecewise compilation by default, which is faster than full-graph. In SGLang, compile seems tied to CUDA graph, so piecewise compile only comes with piecewise CUDA graph—whose overhead might negate the compile benefits anyway.
I understand "beat torch compile" is the long-term direction(#4748) and compile isn't really the focus right now. But given the gains I'm seeing on some models, I'm curious: does anyone know what's actually different between vLLM and SGLang's compile implementations here?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, I've been testing torch.compile on SGLang with Gemma 3 12B, and noticed some significant startup time differences compared to vLLM.
What I'm seeing
I'm getting 5-15% perf gains from compile at lower batch sizes (bs < 16), so I'd like to use it—but the startup cost is pretty rough.
details
My guess
vLLM uses piecewise compilation by default, which is faster than full-graph. In SGLang, compile seems tied to CUDA graph, so piecewise compile only comes with piecewise CUDA graph—whose overhead might negate the compile benefits anyway.
I understand "beat torch compile" is the long-term direction(#4748) and compile isn't really the focus right now. But given the gains I'm seeing on some models, I'm curious: does anyone know what's actually different between vLLM and SGLang's compile implementations here?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions