Skip to content

Expose llama.cpp speculative decoding strategies beyond MTP #190

@leehack

Description

@leehack

Problem

Upstream llama.cpp supports several speculative decoding strategies, but llamadart's native llama.cpp generation path currently exposes none of them through public API.

Existing issue #168 tracks MTP specifically. This issue tracks the rest of the llama.cpp speculative decoding surface, especially draft-model and n-gram strategies.

Current local evidence:

  • GenerationParams has no speculative decoding options.
  • LlamaCppService.generate(...) initializes a normal sampler and runs one sampled token plus one llama_decode(...) per loop iteration.
  • No Dart API exists for an external draft model, draft token counts, n-gram speculative strategy selection, or speculative acceptance metrics.
  • Upstream llama.cpp docs list speculative types such as draft-simple, draft-mtp, ngram-cache, ngram-simple, ngram-map-k, ngram-map-k4v, and ngram-mod.

Proposed work

  1. Design a public speculative decoding API that can cover llama.cpp strategies without forcing LiteRT-LM semantics into the same shape prematurely.
  2. Keep MTP support? #168 focused on draft-mtp; use this issue for external draft-model and n-gram strategies unless the implementation naturally unifies them.
  3. Decide whether upstream common/speculative.* should be wrapped in llamadart-native or reimplemented at the Dart/native-service layer. Prefer native wrapper ownership if upstream common code remains the source of truth.
  4. Add native wrapper symbols and Dart bindings for strategy selection, draft limits, draft model loading where applicable, and stats collection.
  5. Preserve current default behavior with speculation disabled.
  6. Gate incompatible options, such as unsupported web/runtime assets, multimodal combinations, state persistence, prompt-prefix reuse, or grammar paths until validated.
  7. Add diagnostics for generated/accepted draft tokens and acceptance rate.
  8. Add local-only E2E scenarios for at least one n-gram strategy and one external draft-model strategy.

Acceptance criteria

  • Users can enable at least one non-MTP llama.cpp speculative strategy explicitly.
  • Default generation remains unchanged.
  • Unsupported combinations fail with LlamaUnsupportedException or another typed/actionable error.
  • Metrics expose draft generated/accepted counts and acceptance rate.
  • Docs distinguish MTP support? #168 MTP support from this broader speculative-decoding feature set.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions