Checks
Question details
Hello! I want to contribute to my research. I don't want to change any code or add any features; I want to show some statistics with the same setup (RTX 5090) to show people real RTF and limits they can achieve with F5 (At least at the current moment of my progress). I will use the TTFS term as (Time to First Sound) instead of TTFS as (Time To Final Segment) because I implemented chunked streaming, and my final segment is basically the first chunk.
I want to contribute about:
- FP16 vs FP32 (Quality / Latency)
- preprocess and decode overhead
- Rust-based deployment and comparison
- Rust vs Python ORT delta
- benchmarks chunked streaming
- runtime vs schedule optimisation breakdown
- duration formula side-effect (Slowmode + Artefacts vs Last word cut)
One of the examples that I would like to provide:
| Stage |
ours-onnx (Rust) |
onnx-dakeqq (Python, IO Binding) |
pytorch (FP16) |
Notes |
| Preprocess |
10ms |
42ms |
1ms |
Rust vs Python ORT overhead; PyTorch has no separate preprocess |
| Transformer |
266ms |
290ms |
297ms |
IO Binding vs IO Binding vs native PyTorch (sway sampling) |
| Decode |
3ms |
17ms |
2ms |
Vocos; PyTorch runs decode in-process |
| Total |
280ms |
350ms |
299ms |
|
| per step |
16.6ms |
18.1ms |
18.5ms |
Rust IO Binding + custom ORT wins per-step |
| Output duration |
7.97s |
7.94s |
7.95s |
Forced equal via fixed mel frames |
| RTF |
0.035 |
0.044 |
0.038 |
|
| Steps |
16 (EPSS) |
16 (EPSS) |
16 (sway) |
|
I could also provide .wav file examples with the same voice/seed/text to compare.
The question is: What is the best way for me to provide such info? Some pull request? Or Discussion?
Checks
Question details
Hello! I want to contribute to my research. I don't want to change any code or add any features; I want to show some statistics with the same setup (RTX 5090) to show people real RTF and limits they can achieve with F5 (At least at the current moment of my progress). I will use the TTFS term as (Time to First Sound) instead of TTFS as (Time To Final Segment) because I implemented chunked streaming, and my final segment is basically the first chunk.
I want to contribute about:
One of the examples that I would like to provide:
I could also provide .wav file examples with the same voice/seed/text to compare.
The question is: What is the best way for me to provide such info? Some pull request? Or Discussion?