Expose LiteRT-LM speculative decoding / MTP through LlamaEngine

## Problem

LiteRT-LM upstream and the pinned native runtime expose speculative decoding for Gemma 4 / MTP-style acceleration, but the normal llamadart LiteRT-LM backend currently disables it.

Current local evidence:

- `LiteRtLmRuntimeClient.initialize(...)` has a `speculativeDecoding` parameter and forwards it to `litert_lm_engine_settings_set_enable_speculative_decoding`.
- `LiteRtLmService._ensureClientForRuntime(...)` hard-codes `speculativeDecoding: false` for the public `LlamaEngine(LlamaBackend())` / `LiteRtLmBackend` path.
- `example/chat_app/lib/litert_lm_benchmark_app.dart` exposes a speculative toggle but prints `speculative: ignored by backend API`.
- Upstream LiteRT-LM README documents Gemma 4 MTP/speculative decoding CLI usage with `--enable-speculative-decoding=true`.

This means users can only experiment through the low-level runtime client, not through the normal llamadart engine/backend API used by apps.

## Likely reason it was disabled initially

PR #167 was scoped as a stable LiteRT-LM backend and fair benchmark POC, not full feature parity. Forcing speculative decoding off kept the high-level path deterministic and comparable while cancellation, streaming, tool/thinking parsing, metrics, backend selection, and runtime packaging were being stabilized. There is currently no public API knob, capability probe, validation, metrics split, or real-model E2E coverage for speculative-on behavior.

## Proposed work

1. Add an explicit public opt-in, likely on `GenerationParams` or LiteRT-LM-specific generation options.
2. Pass the option through `LiteRtLmService` into `LiteRtLmRuntimeClient.initialize(...)`.
3. Decide default behavior. Keep default off unless real-model validation proves speculative-on is safe and consistently beneficial for the relevant `.litertlm` bundles.
4. Report whether speculative decoding was enabled in diagnostics/metrics so benchmark output is unambiguous.
5. Add validation for unsupported platforms/models/runtime versions with typed actionable errors.
6. Add smoke/E2E coverage with a Gemma 4 `.litertlm` model on at least one native platform, ideally including Pixel GPU/NPU if available.
7. Update README, backend-selection docs, benchmark docs, and changelog.

## Acceptance criteria

- Users can explicitly enable LiteRT-LM speculative decoding through the normal `LlamaEngine` path.
- The benchmark app's speculative toggle actually affects LiteRT-LM runtime initialization or is removed from the UI.
- Default behavior remains stable and documented.
- Metrics/logs clearly state speculative enabled/disabled.
- Unsupported combinations fail loudly instead of silently ignoring the option.
- Real-model validation records throughput and output sanity for speculative on vs off.

## Related

- PR #167: initial LiteRT-LM backend support.
- Issue #168 tracks llama.cpp MTP; this issue is specifically for the LiteRT-LM `.litertlm` runtime path.
- Existing follow-ups: #173 LiteRT-LM LoRA, #174 LiteRT-LM structured output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose LiteRT-LM speculative decoding / MTP through LlamaEngine #188

Problem

Likely reason it was disabled initially

Proposed work

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Expose LiteRT-LM speculative decoding / MTP through LlamaEngine #188

Description

Problem

Likely reason it was disabled initially

Proposed work

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions