Problem
LiteRT-LM upstream and the pinned native runtime expose speculative decoding for Gemma 4 / MTP-style acceleration, but the normal llamadart LiteRT-LM backend currently disables it.
Current local evidence:
LiteRtLmRuntimeClient.initialize(...) has a speculativeDecoding parameter and forwards it to litert_lm_engine_settings_set_enable_speculative_decoding.
LiteRtLmService._ensureClientForRuntime(...) hard-codes speculativeDecoding: false for the public LlamaEngine(LlamaBackend()) / LiteRtLmBackend path.
example/chat_app/lib/litert_lm_benchmark_app.dart exposes a speculative toggle but prints speculative: ignored by backend API.
- Upstream LiteRT-LM README documents Gemma 4 MTP/speculative decoding CLI usage with
--enable-speculative-decoding=true.
This means users can only experiment through the low-level runtime client, not through the normal llamadart engine/backend API used by apps.
Likely reason it was disabled initially
PR #167 was scoped as a stable LiteRT-LM backend and fair benchmark POC, not full feature parity. Forcing speculative decoding off kept the high-level path deterministic and comparable while cancellation, streaming, tool/thinking parsing, metrics, backend selection, and runtime packaging were being stabilized. There is currently no public API knob, capability probe, validation, metrics split, or real-model E2E coverage for speculative-on behavior.
Proposed work
- Add an explicit public opt-in, likely on
GenerationParams or LiteRT-LM-specific generation options.
- Pass the option through
LiteRtLmService into LiteRtLmRuntimeClient.initialize(...).
- Decide default behavior. Keep default off unless real-model validation proves speculative-on is safe and consistently beneficial for the relevant
.litertlm bundles.
- Report whether speculative decoding was enabled in diagnostics/metrics so benchmark output is unambiguous.
- Add validation for unsupported platforms/models/runtime versions with typed actionable errors.
- Add smoke/E2E coverage with a Gemma 4
.litertlm model on at least one native platform, ideally including Pixel GPU/NPU if available.
- Update README, backend-selection docs, benchmark docs, and changelog.
Acceptance criteria
- Users can explicitly enable LiteRT-LM speculative decoding through the normal
LlamaEngine path.
- The benchmark app's speculative toggle actually affects LiteRT-LM runtime initialization or is removed from the UI.
- Default behavior remains stable and documented.
- Metrics/logs clearly state speculative enabled/disabled.
- Unsupported combinations fail loudly instead of silently ignoring the option.
- Real-model validation records throughput and output sanity for speculative on vs off.
Related
Problem
LiteRT-LM upstream and the pinned native runtime expose speculative decoding for Gemma 4 / MTP-style acceleration, but the normal llamadart LiteRT-LM backend currently disables it.
Current local evidence:
LiteRtLmRuntimeClient.initialize(...)has aspeculativeDecodingparameter and forwards it tolitert_lm_engine_settings_set_enable_speculative_decoding.LiteRtLmService._ensureClientForRuntime(...)hard-codesspeculativeDecoding: falsefor the publicLlamaEngine(LlamaBackend())/LiteRtLmBackendpath.example/chat_app/lib/litert_lm_benchmark_app.dartexposes a speculative toggle but printsspeculative: ignored by backend API.--enable-speculative-decoding=true.This means users can only experiment through the low-level runtime client, not through the normal llamadart engine/backend API used by apps.
Likely reason it was disabled initially
PR #167 was scoped as a stable LiteRT-LM backend and fair benchmark POC, not full feature parity. Forcing speculative decoding off kept the high-level path deterministic and comparable while cancellation, streaming, tool/thinking parsing, metrics, backend selection, and runtime packaging were being stabilized. There is currently no public API knob, capability probe, validation, metrics split, or real-model E2E coverage for speculative-on behavior.
Proposed work
GenerationParamsor LiteRT-LM-specific generation options.LiteRtLmServiceintoLiteRtLmRuntimeClient.initialize(...)..litertlmbundles..litertlmmodel on at least one native platform, ideally including Pixel GPU/NPU if available.Acceptance criteria
LlamaEnginepath.Related
.litertlmruntime path.