Expose LiteRT-LM speculative decoding through GenerationParams by leehack · Pull Request #191 · leehack/llamadart

leehack · 2026-05-31T15:00:27Z

Summary

Refs #188.

This exposes LiteRT-LM speculative decoding as an opt-in GenerationParams.speculativeDecoding flag and wires it through the native LiteRT-LM runtime settings. The default remains disabled.

Scope

Add GenerationParams.speculativeDecoding with default false and copy support.
Forward the flag to native LiteRT-LM initialization.
Reject the flag explicitly on llama.cpp, WebGPU, and LiteRT-LM web until those paths expose equivalent support.
Update the LiteRT-LM benchmark app and macOS benchmark helper so the speculative toggle is real and included in metrics.
Document support, unsupported combinations, benchmark guidance, and the measured Gemma 4 E2B results.

Benchmark Notes

The flag is exposed as a tuning knob, not enabled as a default optimization. In measured Gemma 4 E2B runs it was slower:

Pixel 9 Pro LiteRT-LM GPU: false 15.50 wall tok/s, true 9.06 wall tok/s, about 42% slower.
Apple M4 Max LiteRT-LM Metal: false 135.02 wall tok/s, true 118.96 wall tok/s, about 12% slower.

The Pixel NPU path was attempted for gemma-4-E2B-it.litertlm, but native LiteRT-LM failed engine creation for backend npu on this device/model bundle, so there is no NPU performance claim in this PR.

Validation

dart format --output=none --set-exit-if-changed .
dart analyze
dart test -p vm test/unit/core/models/inference/generation_params_test.dart test/unit/backends/litert_lm/litert_lm_service_test.dart test/unit/backends/llama_cpp/llama_cpp_service_test.dart
dart test -p chrome test/unit/backends/litert_lm/litert_lm_backend_web_test.dart test/unit/backends/webgpu/webgpu_backend_test.dart
bash -n tool/macos_fair_litert_vs_llamadart.sh tool/litert_lm_pixel_benchmark.sh
./tool/docs/validate_links.sh
git diff --check

Copilot

Pull request overview

This PR exposes LiteRT-LM native speculative decoding through GenerationParams while keeping unsupported backends explicit and documenting benchmark guidance/results.

Changes:

Adds GenerationParams.speculativeDecoding and forwards it to native LiteRT-LM runtime initialization.
Rejects speculative decoding for llama.cpp, WebGPU, and LiteRT-LM web with tests.
Updates benchmark tooling/app metrics and documentation for LiteRT-LM speculative decoding comparisons.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`lib/src/core/models/inference/generation_params.dart`	Adds the public speculative decoding flag and copy support.
`lib/src/backends/litert_lm/litert_lm_service.dart`	Tracks and forwards speculative decoding to native LiteRT-LM runtime settings.
`lib/src/backends/litert_lm/litert_lm_backend_web.dart`	Rejects speculative decoding on LiteRT-LM web.
`lib/src/backends/llama_cpp/llama_cpp_service.dart`	Rejects speculative decoding for llama.cpp.
`lib/src/backends/webgpu/webgpu_backend.dart`	Rejects speculative decoding for WebGPU.
`example/chat_app/lib/litert_lm_benchmark_app.dart`	Wires the benchmark toggle into generation and records it in metrics.
`tool/macos_fair_litert_vs_llamadart.sh`	Adds a `SPECULATIVE` env toggle for macOS LiteRT-LM benchmarks.
`test/unit/core/models/inference/generation_params_test.dart`	Covers default and copy behavior for the new parameter.
`test/unit/backends/litert_lm/litert_lm_service_test.dart`	Verifies native LiteRT-LM default/off and opt-in/on forwarding.
`test/unit/backends/litert_lm/litert_lm_backend_web_test.dart`	Covers LiteRT-LM web rejection.
`test/unit/backends/llama_cpp/llama_cpp_service_test.dart`	Covers llama.cpp rejection.
`test/unit/backends/webgpu/webgpu_backend_test.dart`	Covers WebGPU rejection.
`README.md`	Documents LiteRT-LM support and unsupported speculative decoding paths.
`CHANGELOG.md`	Records the new speculative decoding opt-in behavior.
`website/docs/configuration/runtime-parameters.md`	Documents the new `GenerationParams` field.
`website/docs/guides/backend-selection.md`	Adds native LiteRT-LM support guidance and benchmark caveats.
`website/docs/guides/backend-benchmarks.md`	Adds measured speculative decoding benchmark results and commands.
`website/docs/guides/performance-tuning.md`	Adds tuning guidance and benchmark commands.
`website/docs/platforms/support-matrix.md`	Updates LiteRT-LM platform support notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov-commenter · 2026-05-31T15:07:04Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.57%. Comparing base (7a9f9d6) to head (5358b08).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #191      +/-   ##
==========================================
+ Coverage   80.55%   80.57%   +0.02%     
==========================================
  Files          85       85              
  Lines       11380    11392      +12     
==========================================
+ Hits         9167     9179      +12     
  Misses       2213     2213

Flag	Coverage Δ
unittests	`80.57% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Expose LiteRT-LM speculative decoding

b26f9d5

leehack marked this pull request as ready for review May 31, 2026 15:02

Copilot AI review requested due to automatic review settings May 31, 2026 15:02

Copilot started reviewing on behalf of leehack May 31, 2026 15:02 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread example/chat_app/lib/litert_lm_benchmark_app.dart

leehack added 2 commits May 31, 2026 11:07

Default speculative benchmark runs off

ad88208

Cover speculative LiteRT-LM client reuse

5358b08

leehack merged commit bd5984f into main May 31, 2026
10 checks passed

leehack deleted the litert-lm-speculative-decoding branch May 31, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose LiteRT-LM speculative decoding through GenerationParams#191

Expose LiteRT-LM speculative decoding through GenerationParams#191
leehack merged 3 commits into
mainfrom
litert-lm-speculative-decoding

leehack commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov-commenter commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

leehack commented May 31, 2026

Summary

Scope

Benchmark Notes

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

codecov-commenter commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented May 31, 2026 •

edited

Loading