Found during the recurring babysit onboarding sweep on qwen3.6-27b-8bit (PyPI rapid-mlx 0.6.66). Filing as one consolidated tracker since they cluster around the legacy /v1/completions endpoint + request-boundary validation. Hybrid model with channel-routed chat works fine; legacy/boundary surface is where the gaps live.
1. /v1/completions echo:true silently ignored
curl -X POST .../v1/completions -d '{"model":"qwen3.6-27b-8bit","prompt":"Once upon a time","max_tokens":15,"echo":true}'
Expected: response text begins with "Once upon a time". Actual: starts mid-continuation. echo accepted by schema but never honored.
2. /v1/completions logprobs schema mismatch with OpenAI
OpenAI spec: logprobs: int (0-5). rapid-mlx declares logprobs: bool. Sending "logprobs":3 (the canonical OpenAI form) → HTTP 422 bool_parsing. Sending "logprobs":true is accepted but no logprobs field appears in response — double-ignored.
3. /v1/completions streaming per-chunk id rotation
Each SSE data: chunk gets a fresh UUID (cmpl-d62f..., cmpl-8f7b...) instead of sharing one stream id. OpenAI streaming spec requires all chunks of one completion to share id. Clients that key on id for dedup/aggregation will break.
4. n=0 accepted as n=1 silently
Boundary validation gap. n=0 should be rejected (Pydantic ge=1), but route accepts it and returns 1 choice. n=null/n=1 correct; n=2+ correctly 400s.
5. qwen3.6 default thinking-marker leak in content
Default request (no enable_thinking flag) returns content like "Here's a thinking process:\n\n1. Analyze..." with no reasoning_content populated. Explicit enable_thinking:true parses correctly. Either qwen3.6's default differs from qwen3.5's, or our alias recommended_template_kwargs is omitting enable_thinking. UX hit — default user sees raw reasoning bleeding into content.
Out of scope for the in-flight fix (#460 is harmony-specific channel-routing); filing as a consolidated tracker for future iterations.
Found during the recurring babysit onboarding sweep on
qwen3.6-27b-8bit(PyPIrapid-mlx 0.6.66). Filing as one consolidated tracker since they cluster around the legacy/v1/completionsendpoint + request-boundary validation. Hybrid model with channel-routed chat works fine; legacy/boundary surface is where the gaps live.1.
/v1/completionsecho:truesilently ignoredcurl -X POST .../v1/completions -d '{"model":"qwen3.6-27b-8bit","prompt":"Once upon a time","max_tokens":15,"echo":true}'Expected: response text begins with
"Once upon a time". Actual: starts mid-continuation.echoaccepted by schema but never honored.2.
/v1/completionslogprobsschema mismatch with OpenAIOpenAI spec:
logprobs: int (0-5). rapid-mlx declareslogprobs: bool. Sending"logprobs":3(the canonical OpenAI form) → HTTP 422bool_parsing. Sending"logprobs":trueis accepted but nologprobsfield appears in response — double-ignored.3.
/v1/completionsstreaming per-chunk id rotationEach SSE
data:chunk gets a fresh UUID (cmpl-d62f...,cmpl-8f7b...) instead of sharing one stream id. OpenAI streaming spec requires all chunks of one completion to shareid. Clients that key onidfor dedup/aggregation will break.4.
n=0accepted asn=1silentlyBoundary validation gap.
n=0should be rejected (Pydanticge=1), but route accepts it and returns 1 choice.n=null/n=1correct;n=2+correctly 400s.5. qwen3.6 default thinking-marker leak in
contentDefault request (no
enable_thinkingflag) returnscontentlike"Here's a thinking process:\n\n1. Analyze..."with noreasoning_contentpopulated. Explicitenable_thinking:trueparses correctly. Either qwen3.6's default differs from qwen3.5's, or our aliasrecommended_template_kwargsis omittingenable_thinking. UX hit — default user sees raw reasoning bleeding into content.Out of scope for the in-flight fix (#460 is harmony-specific channel-routing); filing as a consolidated tracker for future iterations.