Expose llama.cpp speculative decoding strategies beyond MTP

## Problem

Upstream llama.cpp supports several speculative decoding strategies, but llamadart's native llama.cpp generation path currently exposes none of them through public API.

Existing issue #168 tracks MTP specifically. This issue tracks the rest of the llama.cpp speculative decoding surface, especially draft-model and n-gram strategies.

Current local evidence:

- `GenerationParams` has no speculative decoding options.
- `LlamaCppService.generate(...)` initializes a normal sampler and runs one sampled token plus one `llama_decode(...)` per loop iteration.
- No Dart API exists for an external draft model, draft token counts, n-gram speculative strategy selection, or speculative acceptance metrics.
- Upstream llama.cpp docs list speculative types such as `draft-simple`, `draft-mtp`, `ngram-cache`, `ngram-simple`, `ngram-map-k`, `ngram-map-k4v`, and `ngram-mod`.

## Proposed work

1. Design a public speculative decoding API that can cover llama.cpp strategies without forcing LiteRT-LM semantics into the same shape prematurely.
2. Keep #168 focused on `draft-mtp`; use this issue for external draft-model and n-gram strategies unless the implementation naturally unifies them.
3. Decide whether upstream `common/speculative.*` should be wrapped in `llamadart-native` or reimplemented at the Dart/native-service layer. Prefer native wrapper ownership if upstream common code remains the source of truth.
4. Add native wrapper symbols and Dart bindings for strategy selection, draft limits, draft model loading where applicable, and stats collection.
5. Preserve current default behavior with speculation disabled.
6. Gate incompatible options, such as unsupported web/runtime assets, multimodal combinations, state persistence, prompt-prefix reuse, or grammar paths until validated.
7. Add diagnostics for generated/accepted draft tokens and acceptance rate.
8. Add local-only E2E scenarios for at least one n-gram strategy and one external draft-model strategy.

## Acceptance criteria

- Users can enable at least one non-MTP llama.cpp speculative strategy explicitly.
- Default generation remains unchanged.
- Unsupported combinations fail with `LlamaUnsupportedException` or another typed/actionable error.
- Metrics expose draft generated/accepted counts and acceptance rate.
- Docs distinguish #168 MTP support from this broader speculative-decoding feature set.

## Related

- #168: llama.cpp MTP support.
- #188: LiteRT-LM speculative decoding / MTP path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose llama.cpp speculative decoding strategies beyond MTP #190

Problem

Proposed work

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Expose llama.cpp speculative decoding strategies beyond MTP #190

Description

Problem

Proposed work

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions