You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Upstream llama.cpp supports several speculative decoding strategies, but llamadart's native llama.cpp generation path currently exposes none of them through public API.
Existing issue #168 tracks MTP specifically. This issue tracks the rest of the llama.cpp speculative decoding surface, especially draft-model and n-gram strategies.
Current local evidence:
GenerationParams has no speculative decoding options.
LlamaCppService.generate(...) initializes a normal sampler and runs one sampled token plus one llama_decode(...) per loop iteration.
No Dart API exists for an external draft model, draft token counts, n-gram speculative strategy selection, or speculative acceptance metrics.
Upstream llama.cpp docs list speculative types such as draft-simple, draft-mtp, ngram-cache, ngram-simple, ngram-map-k, ngram-map-k4v, and ngram-mod.
Proposed work
Design a public speculative decoding API that can cover llama.cpp strategies without forcing LiteRT-LM semantics into the same shape prematurely.
Keep MTP support? #168 focused on draft-mtp; use this issue for external draft-model and n-gram strategies unless the implementation naturally unifies them.
Decide whether upstream common/speculative.* should be wrapped in llamadart-native or reimplemented at the Dart/native-service layer. Prefer native wrapper ownership if upstream common code remains the source of truth.
Add native wrapper symbols and Dart bindings for strategy selection, draft limits, draft model loading where applicable, and stats collection.
Preserve current default behavior with speculation disabled.
Gate incompatible options, such as unsupported web/runtime assets, multimodal combinations, state persistence, prompt-prefix reuse, or grammar paths until validated.
Add diagnostics for generated/accepted draft tokens and acceptance rate.
Add local-only E2E scenarios for at least one n-gram strategy and one external draft-model strategy.
Acceptance criteria
Users can enable at least one non-MTP llama.cpp speculative strategy explicitly.
Default generation remains unchanged.
Unsupported combinations fail with LlamaUnsupportedException or another typed/actionable error.
Metrics expose draft generated/accepted counts and acceptance rate.
Docs distinguish MTP support? #168 MTP support from this broader speculative-decoding feature set.
Problem
Upstream llama.cpp supports several speculative decoding strategies, but llamadart's native llama.cpp generation path currently exposes none of them through public API.
Existing issue #168 tracks MTP specifically. This issue tracks the rest of the llama.cpp speculative decoding surface, especially draft-model and n-gram strategies.
Current local evidence:
GenerationParamshas no speculative decoding options.LlamaCppService.generate(...)initializes a normal sampler and runs one sampled token plus onellama_decode(...)per loop iteration.draft-simple,draft-mtp,ngram-cache,ngram-simple,ngram-map-k,ngram-map-k4v, andngram-mod.Proposed work
draft-mtp; use this issue for external draft-model and n-gram strategies unless the implementation naturally unifies them.common/speculative.*should be wrapped inllamadart-nativeor reimplemented at the Dart/native-service layer. Prefer native wrapper ownership if upstream common code remains the source of truth.Acceptance criteria
LlamaUnsupportedExceptionor another typed/actionable error.Related