Skip to content

support Flash attention#74

Open
yuekaizhang wants to merge 2 commits into
k2-fsa:masterfrom
yuekaizhang:flash_attn
Open

support Flash attention#74
yuekaizhang wants to merge 2 commits into
k2-fsa:masterfrom
yuekaizhang:flash_attn

Conversation

@yuekaizhang

Copy link
Copy Markdown

Flash Attention Support

Tested with 50 randomly distributed audio samples, randomly grouped into batches of 4.

FlashAttention-2 with packed input (varlen) avoids redundant computation on padding tokens, reducing inference time:

GPU w/o flash_attn w/ flash_attn Speedup
L20 29s 26s ~10%
H20 25s 23s ~8%

Usage:

omnivoice-infer-batch --use_flash_attn --batch_size 4 ...

@huangxuegang1129-oss

Copy link
Copy Markdown

demo.py file not modify??

@ZovutVanya

Copy link
Copy Markdown

Does flash attention only help with batches, or single audio too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants