Skip to content

Expose LiteRT-LM speculative decoding through GenerationParams#191

Merged
leehack merged 3 commits into
mainfrom
litert-lm-speculative-decoding
May 31, 2026
Merged

Expose LiteRT-LM speculative decoding through GenerationParams#191
leehack merged 3 commits into
mainfrom
litert-lm-speculative-decoding

Conversation

@leehack
Copy link
Copy Markdown
Owner

@leehack leehack commented May 31, 2026

Summary

Refs #188.

This exposes LiteRT-LM speculative decoding as an opt-in GenerationParams.speculativeDecoding flag and wires it through the native LiteRT-LM runtime settings. The default remains disabled.

Scope

  • Add GenerationParams.speculativeDecoding with default false and copy support.
  • Forward the flag to native LiteRT-LM initialization.
  • Reject the flag explicitly on llama.cpp, WebGPU, and LiteRT-LM web until those paths expose equivalent support.
  • Update the LiteRT-LM benchmark app and macOS benchmark helper so the speculative toggle is real and included in metrics.
  • Document support, unsupported combinations, benchmark guidance, and the measured Gemma 4 E2B results.

Benchmark Notes

The flag is exposed as a tuning knob, not enabled as a default optimization. In measured Gemma 4 E2B runs it was slower:

  • Pixel 9 Pro LiteRT-LM GPU: false 15.50 wall tok/s, true 9.06 wall tok/s, about 42% slower.
  • Apple M4 Max LiteRT-LM Metal: false 135.02 wall tok/s, true 118.96 wall tok/s, about 12% slower.

The Pixel NPU path was attempted for gemma-4-E2B-it.litertlm, but native LiteRT-LM failed engine creation for backend npu on this device/model bundle, so there is no NPU performance claim in this PR.

Validation

  • dart format --output=none --set-exit-if-changed .
  • dart analyze
  • dart test -p vm test/unit/core/models/inference/generation_params_test.dart test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/llama_cpp/llama_cpp_service_test.dart
  • dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/webgpu/webgpu_backend_test.dart
  • bash -n tool/macos_fair_litert_vs_llamadart.sh tool/litert_lm_pixel_benchmark.sh
  • ./tool/docs/validate_links.sh
  • git diff --check

@leehack leehack marked this pull request as ready for review May 31, 2026 15:02
Copilot AI review requested due to automatic review settings May 31, 2026 15:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR exposes LiteRT-LM native speculative decoding through GenerationParams while keeping unsupported backends explicit and documenting benchmark guidance/results.

Changes:

  • Adds GenerationParams.speculativeDecoding and forwards it to native LiteRT-LM runtime initialization.
  • Rejects speculative decoding for llama.cpp, WebGPU, and LiteRT-LM web with tests.
  • Updates benchmark tooling/app metrics and documentation for LiteRT-LM speculative decoding comparisons.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
lib/src/core/models/inference/generation_params.dart Adds the public speculative decoding flag and copy support.
lib/src/backends/litert_lm/litert_lm_service.dart Tracks and forwards speculative decoding to native LiteRT-LM runtime settings.
lib/src/backends/litert_lm/litert_lm_backend_web.dart Rejects speculative decoding on LiteRT-LM web.
lib/src/backends/llama_cpp/llama_cpp_service.dart Rejects speculative decoding for llama.cpp.
lib/src/backends/webgpu/webgpu_backend.dart Rejects speculative decoding for WebGPU.
example/chat_app/lib/litert_lm_benchmark_app.dart Wires the benchmark toggle into generation and records it in metrics.
tool/macos_fair_litert_vs_llamadart.sh Adds a SPECULATIVE env toggle for macOS LiteRT-LM benchmarks.
test/unit/core/models/inference/generation_params_test.dart Covers default and copy behavior for the new parameter.
test/unit/backends/litert_lm/litert_lm_service_test.dart Verifies native LiteRT-LM default/off and opt-in/on forwarding.
test/unit/backends/litert_lm/litert_lm_backend_web_test.dart Covers LiteRT-LM web rejection.
test/unit/backends/llama_cpp/llama_cpp_service_test.dart Covers llama.cpp rejection.
test/unit/backends/webgpu/webgpu_backend_test.dart Covers WebGPU rejection.
README.md Documents LiteRT-LM support and unsupported speculative decoding paths.
CHANGELOG.md Records the new speculative decoding opt-in behavior.
website/docs/configuration/runtime-parameters.md Documents the new GenerationParams field.
website/docs/guides/backend-selection.md Adds native LiteRT-LM support guidance and benchmark caveats.
website/docs/guides/backend-benchmarks.md Adds measured speculative decoding benchmark results and commands.
website/docs/guides/performance-tuning.md Adds tuning guidance and benchmark commands.
website/docs/platforms/support-matrix.md Updates LiteRT-LM platform support notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread example/chat_app/lib/litert_lm_benchmark_app.dart
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 31, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.57%. Comparing base (7a9f9d6) to head (5358b08).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #191      +/-   ##
==========================================
+ Coverage   80.55%   80.57%   +0.02%     
==========================================
  Files          85       85              
  Lines       11380    11392      +12     
==========================================
+ Hits         9167     9179      +12     
  Misses       2213     2213              
Flag Coverage Δ
unittests 80.57% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leehack leehack merged commit bd5984f into main May 31, 2026
10 checks passed
@leehack leehack deleted the litert-lm-speculative-decoding branch May 31, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants