Why is sgalng's torch.compile startup so much slower than vLLM? #16048

fjybiocs · 2025-12-29T07:07:25Z

fjybiocs
Dec 29, 2025

Hi all, I've been testing torch.compile on SGLang with Gemma 3 12B, and noticed some significant startup time differences compared to vLLM.

What I'm seeing

SGLang without compile: ~1:30 startup
SGLang with compile (bs 1,2,4,8,16): ~6min startup
vLLM with compile enabled (default): ~1min startup

I'm getting 5-15% perf gains from compile at lower batch sizes (bs < 16), so I'd like to use it—but the startup cost is pretty rough.

details

vLLM:

vllm serve /root/models/gemma3 \
    --tensor-parallel-size 1 \
    --max-model-len 2448 \
    --gpu-memory-utilization 0.8 \
    --max-num-seqs 16 \
    --compilation-config '{"cudagraph_capture_sizes": [1,2,4,8,16]}'

sglang:

python -m sglang.launch_server \
  --model-path /root/models/gemma3 \
  --tp 1 \
  --context-length 2448 \
  --mem-fraction-static 0.8 \
  --enable-torch-compile \
  --torch-compile-max-bs 16

My guess

vLLM uses piecewise compilation by default, which is faster than full-graph. In SGLang, compile seems tied to CUDA graph, so piecewise compile only comes with piecewise CUDA graph—whose overhead might negate the compile benefits anyway.

I understand "beat torch compile" is the long-term direction(#4748) and compile isn't really the focus right now. But given the gains I'm seeing on some models, I'm curious: does anyone know what's actually different between vLLM and SGLang's compile implementations here?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why is sgalng's torch.compile startup so much slower than vLLM? #16048

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why is sgalng's torch.compile startup so much slower than vLLM? #16048

Uh oh!

fjybiocs Dec 29, 2025

What I'm seeing

details

My guess

Replies: 0 comments

fjybiocs
Dec 29, 2025