Skip to content

Optimize CUDA inference path#71

Open
groxaxo wants to merge 2 commits into
k2-fsa:masterfrom
groxaxo:copilot/cuda-inference-optimizations
Open

Optimize CUDA inference path#71
groxaxo wants to merge 2 commits into
k2-fsa:masterfrom
groxaxo:copilot/cuda-inference-optimizations

Conversation

@groxaxo

@groxaxo groxaxo commented Apr 9, 2026

Copy link
Copy Markdown

Summary

  • auto-tune OmniVoice CUDA inference runtime for Ampere-class GPUs
  • replace dense iterative-decoding attention masks with flex-attention block masks and avoid full-logit fp32 upcasts
  • document the bottleneck audit and add runtime coverage tests

Testing

  • PYTHONPATH=. pytest -q

OpenCode and others added 2 commits April 9, 2026 19:15
Serve a browser-friendly /ui page, document it, and cover it with unit tests.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants