Summary
MLLMScheduler (the scheduler for vision/multimodal models — gemma-4-12b, qwen-vl, etc.) has zero penalty wiring. All three OpenAI-compatible penalty params are silently ignored on requests routed to it.
$ grep -c "make_logits_processor\|presence_penalty\|frequency_penalty\|repetition_penalty" vllm_mlx/mllm_scheduler.py
0
Surfaced by the round-2 codex review of PR #510. Pre-existing — not introduced by the #470 fix.
Repro
Send frequency_penalty=2.0 to any vision model (e.g. gemma-4-12b) via /v1/chat/completions. The penalty has no observable effect on output — same content with or without the param. The text-model scheduler (vllm_mlx/scheduler.py) correctly applies penalties since #355 + #510; vision models silently drop them.
Root cause
vllm_mlx/mllm_scheduler.py has three blockers, all visible in current main:
- Hardcoded sampler (line 286-287): a TODO comment from the original author already acknowledges the gap:
sampler = make_sampler(temp=0.7, top_p=0.9)
# Default sampler (can be overridden per-request in future)
- API signature only accepts
temperature + top_p (lines 309-310, 921-922) — the penalty fields aren't surfaced as scheduler-level kwargs.
- No
make_logits_processors call anywhere in the file — the text-scheduler's whole penalty-wiring block (vllm_mlx/scheduler.py:2783-2814) has no analogue here.
Proposed fix (sketch — 3 layers)
routes/chat.py MLLM branch: forward presence_penalty / frequency_penalty / repetition_penalty into the scheduler call (analogous to text branch).
MLLMScheduler.__init__ / generate API: accept the three penalty fields, store on SamplingParams.
- Replace the hardcoded
make_sampler(...) with a per-request sampler + logits_processors extended with make_logits_processors(...) matching scheduler.py:2783-2814 (including presence_context_size / frequency_context_size matching the text-scheduler value).
Priority
Low. No user reports — vision-model users with frequency_penalty are a small subset. Filing for visibility so it gets picked up in a future MLLM refactor, or by anyone hitting the silent no-op.
Related
Summary
MLLMScheduler(the scheduler for vision/multimodal models — gemma-4-12b, qwen-vl, etc.) has zero penalty wiring. All three OpenAI-compatible penalty params are silently ignored on requests routed to it.$ grep -c "make_logits_processor\|presence_penalty\|frequency_penalty\|repetition_penalty" vllm_mlx/mllm_scheduler.py 0Surfaced by the round-2 codex review of PR #510. Pre-existing — not introduced by the #470 fix.
Repro
Send
frequency_penalty=2.0to any vision model (e.g.gemma-4-12b) via/v1/chat/completions. The penalty has no observable effect on output — same content with or without the param. The text-model scheduler (vllm_mlx/scheduler.py) correctly applies penalties since #355 + #510; vision models silently drop them.Root cause
vllm_mlx/mllm_scheduler.pyhas three blockers, all visible in currentmain:temperature+top_p(lines 309-310, 921-922) — the penalty fields aren't surfaced as scheduler-level kwargs.make_logits_processorscall anywhere in the file — the text-scheduler's whole penalty-wiring block (vllm_mlx/scheduler.py:2783-2814) has no analogue here.Proposed fix (sketch — 3 layers)
routes/chat.pyMLLM branch: forwardpresence_penalty / frequency_penalty / repetition_penaltyinto the scheduler call (analogous to text branch).MLLMScheduler.__init__/ generate API: accept the three penalty fields, store onSamplingParams.make_sampler(...)with a per-request sampler +logits_processorsextended withmake_logits_processors(...)matchingscheduler.py:2783-2814(includingpresence_context_size/frequency_context_sizematching the text-scheduler value).Priority
Low. No user reports — vision-model users with
frequency_penaltyare a small subset. Filing for visibility so it gets picked up in a future MLLM refactor, or by anyone hitting the silent no-op.Related