Streaming SoundQueue refactor. Addition of Latency Metrics. Addition of Piper backend#268
Streaming SoundQueue refactor. Addition of Latency Metrics. Addition of Piper backend#268mitcheb0219 wants to merge 13 commits intokarashiiro:mainfrom
Conversation
Added Logging to measure time to first audio
| bool includeSpeakAttributes = true) | ||
|
|
||
| { | ||
| text = System.Security.SecurityElement.Escape(text); |
There was a problem hiding this comment.
This can be ignored as it was already added as a bugfix.
| "Microsoft.ML.OnnxRuntime.Gpu": { | ||
| "type": "Transitive", | ||
| "resolved": "1.23.2", | ||
| "contentHash": "4GNQUc6FHiWHvp95Yhu95SUDa6HVm+RSQxm7QCH3PIlderDhTPdU98fHHKXmLy4xIQikkEraMcGe+KXEQU5tew==", | ||
| "dependencies": { | ||
| "Microsoft.ML.OnnxRuntime.Gpu.Linux": "1.23.2", | ||
| "Microsoft.ML.OnnxRuntime.Gpu.Windows": "1.23.2", | ||
| "Microsoft.ML.OnnxRuntime.Managed": "1.23.2" | ||
| } | ||
| }, | ||
| "Microsoft.ML.OnnxRuntime.Gpu.Linux": { | ||
| "type": "Transitive", | ||
| "resolved": "1.23.2", | ||
| "contentHash": "bcv2zpP8GNnfdUCkOjE9lzIoslAOCuY0T9QHpI5+Qm6qUcehRPtGC8wF4nvySwyfTe0g3rVINP3SSj1zinkE7Q==", | ||
| "dependencies": { | ||
| "Microsoft.ML.OnnxRuntime.Managed": "1.23.2" | ||
| } | ||
| }, | ||
| "Microsoft.ML.OnnxRuntime.Gpu.Windows": { | ||
| "type": "Transitive", | ||
| "resolved": "1.23.2", | ||
| "contentHash": "qOU3DVcxq4XalFV3wlrNrdatYWufIqvg8FZqVC3LS2rFPoTfl++xpMC2nnaxB2Wc5jrpDb2izrcDsQatCyjVnA==", | ||
| "dependencies": { | ||
| "Microsoft.ML.OnnxRuntime.Managed": "1.23.2" | ||
| } | ||
| }, |
There was a problem hiding this comment.
I thought I had removed these before pushing. Was making attempts to utilize CUDA for synth processing but the entire suite was proving to be too bulky/cumbersome for only marginal gains.
| long methodStart = Stopwatch.GetTimestamp(); | ||
| _ttsCts?.Cancel(); | ||
| _ttsCts = new CancellationTokenSource(); | ||
| var token = _ttsCts.Token; |
There was a problem hiding this comment.
Cancellation Tokens were added to each backend's Say method in order to allow cancellation at all 3 stages:
- Before Synthesis Starts (cancellation token)
- During Synthesis / Playback (cancellation token and
StopHardware) - After Synthesis / During Playback (
StopHardware)
| private static void HandleResult(SpeechSynthesisResult res) | ||
| { | ||
| if (res.Reason == ResultReason.Canceled) | ||
| { | ||
| var cancellation = SpeechSynthesisCancellationDetails.FromResult(res); | ||
| if (cancellation.Reason == CancellationReason.Error) | ||
| { | ||
| DetailedLog.Error($"Azure request error: ({cancellation.ErrorCode}) \"{cancellation.ErrorDetails}\""); | ||
| } | ||
| else | ||
| { | ||
| DetailedLog.Warn($"Azure request failed in state \"{cancellation.Reason}\""); | ||
| } | ||
|
|
||
| return; | ||
| } | ||
|
|
||
| if (res.Reason != ResultReason.SynthesizingAudioCompleted) | ||
| { | ||
| DetailedLog.Warn($"Speech synthesis request completed in incomplete state \"{res.Reason}\""); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Given the new synth method, it didn't seem this was necessary any more. I could be wrong, but wanted to make a note of it here so it gets attention.
| public static bool ImGuiStylesCombo(string label, string previewText, SortedSet<int> selectedIndices, List<string> styles) | ||
| { | ||
| // Use the passed-in string, or a placeholder if it's empty | ||
| string displayValue = !string.IsNullOrEmpty(previewText) ? previewText : "None selected"; | ||
|
|
||
| bool didChange = false; | ||
|
|
||
| // The second parameter of BeginCombo controls what is shown in the closed box | ||
| if (ImGui.BeginCombo(label, displayValue)) | ||
| { | ||
| for (int i = 0; i < styles.Count; i++) | ||
| { | ||
| bool isSelected = selectedIndices.Contains(i); | ||
|
|
||
| // Use Selectable with DontClosePopups for multi-select | ||
| if (ImGui.Selectable(styles[i], isSelected, ImGuiSelectableFlags.DontClosePopups)) | ||
| { | ||
| if (!isSelected) | ||
| selectedIndices.Add(i); | ||
| else | ||
| selectedIndices.Remove(i); | ||
|
|
||
| didChange = true; | ||
| } | ||
|
|
||
| if (isSelected) | ||
| ImGui.SetItemDefaultFocus(); | ||
| } | ||
|
|
||
| ImGui.EndCombo(); | ||
| } | ||
|
|
||
| return didChange; | ||
| } |
There was a problem hiding this comment.
For the backends with custom styles enabled (ElevenLabs and OpenAI). User can now select multiple styles from whatever they have configured. The backends do a decent job (sometimes) of mixing them as intended (example: Accent: Irish and Tone: Shouting)
| var validSampleRates = new[] { "8000", "16000", "22050", "24000" }; | ||
| var sampleRate = currentVoicePreset.SampleRate.ToString(); | ||
| var sampleRateIndex = Array.IndexOf(validSampleRates, sampleRate); | ||
| if (ImGui.Combo($"Sample rate##{MemoizedId.Create()}", ref sampleRateIndex, validSampleRates, | ||
| validSampleRates.Length)) | ||
| { | ||
| currentVoicePreset.SampleRate = int.Parse(validSampleRates[sampleRateIndex]); | ||
| this.config.Save(); | ||
| } | ||
|
|
||
| var pitch = currentVoicePreset.Pitch ?? 0; | ||
| if (ImGui.SliderFloat($"Pitch##{MemoizedId.Create()}", ref pitch, -10f, 10f, "%.2fx")) | ||
| { | ||
| currentVoicePreset.Pitch = pitch; | ||
| config.Save(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Moving GoogleCloud to its streaming capability made it necessary for these sliders to be removed. This is because only Chirp and HD-3 voices are capable of streaming their output and these voices cannot have their sample rates or pitches changed in the request. This is being removed to avoid unintended Bad Request responses from the GoogleCloud SDK.
| // 0.25 - 4.0 (default 1.0) | ||
| // 0.25 - 2.0 (default 1.0) |
There was a problem hiding this comment.
adjusted within GoogleCloud Chirp voice maximums
| if (nextItem.Aborted) break; | ||
|
|
||
| var samples = model.Infer(chunk, nextItem.Voice.Features, nextItem.Speed); | ||
| byte[] bytes = KokoroPlayback.GetBytes(samples); | ||
|
|
||
| // POST-INFERENCE ABORT CHECK: Prevent enqueuing "zombie" audio | ||
| if (nextItem.Aborted) break; | ||
|
|
||
| lock (this.soundLock) | ||
| { | ||
| if (this.bufferedProvider != null && this.soundOut != null) | ||
| { | ||
| this.bufferedProvider.AddSamples(bytes, 0, bytes.Length); | ||
| if (this.soundOut.PlaybackState != PlaybackState.Playing) | ||
| { | ||
| if (nextItem.StartTime.HasValue) | ||
| { | ||
| var elapsed = Stopwatch.GetElapsedTime(nextItem.StartTime.Value); | ||
| this.latencyTracker.AddLatency(elapsed.TotalMilliseconds); | ||
| Log.Debug("Total Latency (Say -> Play): {Ms}", elapsed.TotalMilliseconds); | ||
| } | ||
| this.soundOut.Play(); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Kokoro still has its own SoundQueue. This probably could be joined into the StreamingSoundQueue class via the EnqueueSound method.
| foreach (var styleName in config.CustomVoiceStyles) | ||
| { | ||
| bool isSelected = currentVoicePreset.Styles.Contains(styleName); | ||
|
|
||
| if (ImGui.Selectable(styleName, isSelected, ImGuiSelectableFlags.DontClosePopups)) | ||
| { | ||
| if (isSelected) | ||
| currentVoicePreset.Styles.Remove(styleName); | ||
| else | ||
| currentVoicePreset.Styles.Add(styleName); | ||
|
|
||
| currentVoicePreset.SyncStringFromSet(); | ||
| this.config.Save(); | ||
| } | ||
| } | ||
| ImGui.EndCombo(); |
There was a problem hiding this comment.
OpenAI Voice styles now allows the user to select multiple styles from the list. They will concatenate together as part of the voice preset and be sent with every TTS request. This is very similar to the already-existing instructions field for OpenAI but ties in to the new VoiceStyles config window.
There was a problem hiding this comment.
Entirely new backend that is built to function similar to Kokoro. Upon selection, engine is downloaded/installed within TextToTalk directory and a single voice is downloaded with it. User then has the option to download/delete additional voices at their discretion. UI element for voice directory size added for user visibility as the total suite CAN take up to 9GB of drive storage.
| Azure, | ||
| System, | ||
| Uberduck, | ||
| Piper, | ||
| PiperLow, | ||
| PiperHigh, |
There was a problem hiding this comment.
Additional StreamFormats added to ensure proper routing in StreamingSoundQueue class.
| Log.Information("Playing"); | ||
| this.soundOut = new DirectSoundOut(playbackDeviceId); | ||
| this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); }; | ||
| this.soundOut.Init(volumeSampleProvider); |
There was a problem hiding this comment.
These changes were just to log initial timestamps for audio playback. This class has been left unchanged however I believe this build will have left 0 references to it.
There was a problem hiding this comment.
IMO we should just delete this class if it's unused.
There was a problem hiding this comment.
New class built to handle Streaming requests. Network vs Seekable streams have their logic splits as well as compressed/uncompressed stream audio which varies from backend to backend.
Ultimately the LatencyTracker logic ends up here to keep track of timestamp diffs so we can empirically see if this whole effort has resulted in better performance.
Audio device selection capability is preserved as well.
There was a problem hiding this comment.
Because Microsoft Speech Synthesis is so rigid about its audio playback. System now has its own sound queue again. Though it does utilize streaming, it required a custom bridge to be built to allow the bytes to stream from the SpeechAsync request to the playback device. Because of this, it made sense to keep that custom code contained here.
There was a problem hiding this comment.
Uberduck needed a major rework as this backend didn't function at all since they updated to a versioned API.
Even with the streaming capabilities, their API still locks any audio from being downloaded until the entire synthesis is complete. So latency is still kind of bad with this one.
| Gathering = 67, | ||
| FCAnnouncement = 69, | ||
| FCLogin = 70, | ||
| RetainerSale = 71, | ||
| //RetainerSale = 71, | ||
| PartyFinderState = 72, | ||
| ActionUsedOnYou = 2091, | ||
| FailedActionUsedOnYou = 2218, |
There was a problem hiding this comment.
this can be ignored as it was part of the last bugfix. Just forgot to sync up
| if (!this.filters.OnlyMessagesFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return; | ||
|
|
||
| if (!this.filters.ShouldSayFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return; | ||
|
|
||
| OnTextEmit.Invoke(new ChatTextEmitEvent( | ||
| GetCleanSpeakerName(speaker, sender), | ||
| textValue, | ||
| speaker, | ||
| type)); | ||
|
|
||
| else if (type == XivChatType.TellOutgoing && config.SkipMessagesFromYou == true) return; | ||
|
|
||
| OnTextEmit.Invoke(new ChatTextEmitEvent( | ||
| GetCleanSpeakerName(speaker, sender), | ||
| textValue, | ||
| speaker, | ||
| type)); |
There was a problem hiding this comment.
this can also be ignored as it was part of the last bugfix regarding outgoing tells
| { | ||
| this.synthesizer?.Dispose(); | ||
| this.soundQueue?.Dispose(); | ||
| this.soundQueue?.Dispose(); |
There was a problem hiding this comment.
nit: duplicate Dispose() call
| apiKey = (credentials.Password); | ||
| } | ||
| //RawStreamingSoundQueue = new RawStreamingSoundQueue(config); | ||
| OpenAi = new OpenAiClient(SoundQueue, apiKey); |
There was a problem hiding this comment.
What happens if the credentials are incorrect here (if nothing happens that's fine)?
There was a problem hiding this comment.
Because the API key is checked in the UI when it's entered, an invalid key won't get saved.
I tried revoking my OpenAI key after having it saved in the CredentialManager and sending a speech request. It simply returned a 401 unauthorized error but no crashes.
| Log.Information("Playing"); | ||
| this.soundOut = new DirectSoundOut(playbackDeviceId); | ||
| this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); }; | ||
| this.soundOut.Init(volumeSampleProvider); |
There was a problem hiding this comment.
IMO we should just delete this class if it's unused.
| public int? SampleRate { get; set; } | ||
|
|
||
| // -20.0 - 20.0 is theoretical max, but it's lowered to work better with sliders (default 0.0) | ||
| public float? Pitch { get; set; } |
There was a problem hiding this comment.
Are these supposed to be removed?
There was a problem hiding this comment.
Yes for Pitch, because a key to utilizing streaming with Google Cloud is to only call the voices/voice types that are enabled for streaming. These voices are more lifelike and do not have toggles for Pitch.
Sample Rate was supposed to be a temporary omission until I got the playback working. I've added it back in coming up in the next commit.
| foreach (var voice in response.Voices) | ||
| { | ||
| fetchedVoices.Add(voice.Name, new | ||
| if (voice.Name.Contains("Chirp3") || voice.Name.Contains("Chirp-HD")) // Focusing on Chirp 3 and Chirp HD voices as these are the only ones enabled for streaming. From what I can tell, this actually reduces duplicates of the same voice under different formats. |
There was a problem hiding this comment.
How does this behave when someone already had a different engine in use before this?
There was a problem hiding this comment.
Hmm.
---> Grpc.Core.RpcException: Status(StatusCode="InvalidArgument", Detail="Currently, only Chirp 3: HD voices are supported for streaming synthesis.")
This response comes back from the API. What do you think should be done here? Force onto a new voice? Migrate the presets? Or keep the old non-streaming enabled voices and fork the audio processing?
There was a problem hiding this comment.
Turns out it wasn't too hard to branch the voice processing based on a boolean that's determined by the voice name. So this way all voices can be kept in the backend, just now HD voices will leverage bi-directional streaming.
| private float[] dataArray = Array.Empty<float>(); | ||
| private DateTime lastUpdateTime = DateTime.MinValue; | ||
| private readonly object updateLock = new(); | ||
| public static StatsWindow? Instance { get; private set; } |
There was a problem hiding this comment.
We can retrieve the window instance from WindowSystem, so we shouldn't need a static reference here.
There was a problem hiding this comment.
This was my way of shortcutting the need to pass the window instance into the Configuration Window. It's the same way I configured the Voice Styles window so it could be opened independently of the Config Window.
| } | ||
|
|
||
| // 2. Branch logic based on format (Encoded vs Raw) | ||
| if (nextItem.Format == StreamFormat.Mp3 || nextItem.Format == StreamFormat.Uberduck) |
There was a problem hiding this comment.
This appears to be the only place where we use StreamFormat.Uberduck, so we can just remove it and use StreamFormat.Mp3 instead.
There was a problem hiding this comment.
Change made for next commit
|
|
||
| private LatencyTracker latencyTracker = latencyTracker; | ||
|
|
||
| public void EnqueueSound(Stream data, TextSource source, float volume, StreamFormat format, HttpResponseMessage? response, long? timeStamp) |
There was a problem hiding this comment.
Maybe we should add a waveFormat parameter to this method, and to StreamingSoundQueueItem? This class shouldn't need to know the audio formats of each backend beyond the actual encoding. That way we can get rid of all of the provider-specific StreamFormat values, I think.
There was a problem hiding this comment.
In the next Commit I will have cleaned up the extraneous formats and streamlined the raw PCM Streams into one of four Wave Formats.
All are 16-bit mono.
8kHz
16kHz
22.05kHz
24kHz
Any backends that leverage Mp3 have their SampleRates read directly from the frames and do not require the rate to be given. Polly is an example.
| { | ||
| frame = Mp3Frame.LoadFromStream(readFullyStream); | ||
| } | ||
| catch (Exception) // Catching interruptions here |
There was a problem hiding this comment.
Can we catch a more specific class here?
There was a problem hiding this comment.
I've actually removed this try block altogether. It was originally put in when I was figuring out how to gracefully cancel the synthesis generation while playback was occurring. Initially the cancellations generated unhandled exceptions so this was put in to smooth out the testing process. It's no longer needed.
| { | ||
| this.uiModel.SoundQueue.CancelAllSounds(); | ||
|
|
||
| if (uiModel.OpenAi._ttsCts != null) |
There was a problem hiding this comment.
We should add a Cancel() method on OpenAiClient and possibly also the UIModel instead of accessing deep internals like this. Same applies to CancelSay.
There was a problem hiding this comment.
I did something similar for Elevenlabs as well. I was trying to avoid having to pass the Client class into the backend class in order to avoid a circular dependency, or to have to re-order the way the backend gets initialized. But agreed this one feels bad.
I'll need to look closer at the other backends (like GoogleCloud) and mimic its constructors. It'll be the next change I write up.
There was a problem hiding this comment.
Refactored OpenAI and Elevenlabs backends to closer align to other backends in terms of structure. This reduces the amount of layers these methods have to reach in order to call the cancellation tokens.
Extensive rework of SoundQueues to utilize streaming to a buffer. This allows playback while synthesis is still in progress. For some backends this has significantly reduced latency.
Made changes for Uberduck backend to support their versioned API and leveraging of API Keys.
Added Piper Backend. User has full control over downloaded voices and can monitor size of voice directory in the ConfigWindow.
Added Latency Tracker (accessible via ConfigWindow or
/tttstatscommand. This shows latency for all TTS requests during session. Stats can be manually cleared for testing. This is to give the user an idea of how long it takes for time to first audio.