Skip to content

Expose LiteRT-LM speculative decoding / MTP through LlamaEngine #188

@leehack

Description

@leehack

Problem

LiteRT-LM upstream and the pinned native runtime expose speculative decoding for Gemma 4 / MTP-style acceleration, but the normal llamadart LiteRT-LM backend currently disables it.

Current local evidence:

  • LiteRtLmRuntimeClient.initialize(...) has a speculativeDecoding parameter and forwards it to litert_lm_engine_settings_set_enable_speculative_decoding.
  • LiteRtLmService._ensureClientForRuntime(...) hard-codes speculativeDecoding: false for the public LlamaEngine(LlamaBackend()) / LiteRtLmBackend path.
  • example/chat_app/lib/litert_lm_benchmark_app.dart exposes a speculative toggle but prints speculative: ignored by backend API.
  • Upstream LiteRT-LM README documents Gemma 4 MTP/speculative decoding CLI usage with --enable-speculative-decoding=true.

This means users can only experiment through the low-level runtime client, not through the normal llamadart engine/backend API used by apps.

Likely reason it was disabled initially

PR #167 was scoped as a stable LiteRT-LM backend and fair benchmark POC, not full feature parity. Forcing speculative decoding off kept the high-level path deterministic and comparable while cancellation, streaming, tool/thinking parsing, metrics, backend selection, and runtime packaging were being stabilized. There is currently no public API knob, capability probe, validation, metrics split, or real-model E2E coverage for speculative-on behavior.

Proposed work

  1. Add an explicit public opt-in, likely on GenerationParams or LiteRT-LM-specific generation options.
  2. Pass the option through LiteRtLmService into LiteRtLmRuntimeClient.initialize(...).
  3. Decide default behavior. Keep default off unless real-model validation proves speculative-on is safe and consistently beneficial for the relevant .litertlm bundles.
  4. Report whether speculative decoding was enabled in diagnostics/metrics so benchmark output is unambiguous.
  5. Add validation for unsupported platforms/models/runtime versions with typed actionable errors.
  6. Add smoke/E2E coverage with a Gemma 4 .litertlm model on at least one native platform, ideally including Pixel GPU/NPU if available.
  7. Update README, backend-selection docs, benchmark docs, and changelog.

Acceptance criteria

  • Users can explicitly enable LiteRT-LM speculative decoding through the normal LlamaEngine path.
  • The benchmark app's speculative toggle actually affects LiteRT-LM runtime initialization or is removed from the UI.
  • Default behavior remains stable and documented.
  • Metrics/logs clearly state speculative enabled/disabled.
  • Unsupported combinations fail loudly instead of silently ignoring the option.
  • Real-model validation records throughput and output sanity for speculative on vs off.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions