Optimize CUDA inference path by groxaxo · Pull Request #71 · k2-fsa/OmniVoice

groxaxo · 2026-04-09T09:32:15Z

Summary

auto-tune OmniVoice CUDA inference runtime for Ampere-class GPUs
replace dense iterative-decoding attention masks with flex-attention block masks and avoid full-logit fp32 upcasts
document the bottleneck audit and add runtime coverage tests

Testing

PYTHONPATH=. pytest -q

Serve a browser-friendly /ui page, document it, and cover it with unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

OpenCode and others added 2 commits April 9, 2026 19:15

Add OmniVoice credits page

6985abd

Serve a browser-friendly /ui page, document it, and cover it with unit tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Optimize CUDA inference path

2d20c4c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize CUDA inference path#71

Optimize CUDA inference path#71
groxaxo wants to merge 2 commits into
k2-fsa:masterfrom
groxaxo:copilot/cuda-inference-optimizations

groxaxo commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

groxaxo commented Apr 9, 2026

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants