Streaming SoundQueue refactor. Addition of Latency Metrics. Addition of Piper backend by mitcheb0219 · Pull Request #268 · karashiiro/TextToTalk

mitcheb0219 · 2026-01-22T02:33:15Z

Extensive rework of SoundQueues to utilize streaming to a buffer. This allows playback while synthesis is still in progress. For some backends this has significantly reduced latency.

Made changes for Uberduck backend to support their versioned API and leveraging of API Keys.

Added Piper Backend. User has full control over downloaded voices and can monitor size of voice directory in the ConfigWindow.

Added Latency Tracker (accessible via ConfigWindow or /tttstats command. This shows latency for all TTS requests during session. Stats can be manually cleared for testing. This is to give the user an idea of how long it takes for time to first audio.

Added Logging to measure time to first audio

…pdown

…g refactor

mitcheb0219 · 2026-01-22T19:36:28Z

src/TextToTalk.Lexicons/LexiconManager.cs

        bool includeSpeakAttributes = true)

    {
+        text = System.Security.SecurityElement.Escape(text);


This can be ignored as it was already added as a bugfix.

mitcheb0219 · 2026-01-22T19:38:29Z

src/TextToTalk.Tests/packages.lock.json

+      "Microsoft.ML.OnnxRuntime.Gpu": {
+        "type": "Transitive",
+        "resolved": "1.23.2",
+        "contentHash": "4GNQUc6FHiWHvp95Yhu95SUDa6HVm+RSQxm7QCH3PIlderDhTPdU98fHHKXmLy4xIQikkEraMcGe+KXEQU5tew==",
+        "dependencies": {
+          "Microsoft.ML.OnnxRuntime.Gpu.Linux": "1.23.2",
+          "Microsoft.ML.OnnxRuntime.Gpu.Windows": "1.23.2",
+          "Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
+        }
+      },
+      "Microsoft.ML.OnnxRuntime.Gpu.Linux": {
+        "type": "Transitive",
+        "resolved": "1.23.2",
+        "contentHash": "bcv2zpP8GNnfdUCkOjE9lzIoslAOCuY0T9QHpI5+Qm6qUcehRPtGC8wF4nvySwyfTe0g3rVINP3SSj1zinkE7Q==",
+        "dependencies": {
+          "Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
+        }
+      },
+      "Microsoft.ML.OnnxRuntime.Gpu.Windows": {
+        "type": "Transitive",
+        "resolved": "1.23.2",
+        "contentHash": "qOU3DVcxq4XalFV3wlrNrdatYWufIqvg8FZqVC3LS2rFPoTfl++xpMC2nnaxB2Wc5jrpDb2izrcDsQatCyjVnA==",
+        "dependencies": {
+          "Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
+        }
+      },


I thought I had removed these before pushing. Was making attempts to utilize CUDA for synth processing but the entire suite was proving to be too bulky/cumbersome for only marginal gains.

mitcheb0219 · 2026-01-22T19:42:53Z

src/TextToTalk/Backends/Azure/AzureClient.cs

+        long methodStart = Stopwatch.GetTimestamp();
+        _ttsCts?.Cancel();
+        _ttsCts = new CancellationTokenSource();
+        var token = _ttsCts.Token;


Cancellation Tokens were added to each backend's Say method in order to allow cancellation at all 3 stages:

Before Synthesis Starts (cancellation token)

During Synthesis / Playback (cancellation token and StopHardware)

After Synthesis / During Playback (StopHardware)

mitcheb0219 · 2026-01-22T19:44:17Z

src/TextToTalk/Backends/Azure/AzureClient.cs

-    private static void HandleResult(SpeechSynthesisResult res)
-    {
-        if (res.Reason == ResultReason.Canceled)
-        {
-            var cancellation = SpeechSynthesisCancellationDetails.FromResult(res);
-            if (cancellation.Reason == CancellationReason.Error)
-            {
-                DetailedLog.Error($"Azure request error: ({cancellation.ErrorCode}) \"{cancellation.ErrorDetails}\"");
-            }
-            else
-            {
-                DetailedLog.Warn($"Azure request failed in state \"{cancellation.Reason}\"");
-            }
-
-            return;
-        }
-
-        if (res.Reason != ResultReason.SynthesizingAudioCompleted)
-        {
-            DetailedLog.Warn($"Speech synthesis request completed in incomplete state \"{res.Reason}\"");
-        }
-    }
-


Given the new synth method, it didn't seem this was necessary any more. I could be wrong, but wanted to make a note of it here so it gets attention.

mitcheb0219 · 2026-01-22T19:46:21Z

src/TextToTalk/Backends/BackendUI.cs

+    public static bool ImGuiStylesCombo(string label, string previewText, SortedSet<int> selectedIndices, List<string> styles)
+    {
+        // Use the passed-in string, or a placeholder if it's empty
+        string displayValue = !string.IsNullOrEmpty(previewText) ? previewText : "None selected";
+
+        bool didChange = false;
+
+        // The second parameter of BeginCombo controls what is shown in the closed box
+        if (ImGui.BeginCombo(label, displayValue))
+        {
+            for (int i = 0; i < styles.Count; i++)
+            {
+                bool isSelected = selectedIndices.Contains(i);
+
+                // Use Selectable with DontClosePopups for multi-select
+                if (ImGui.Selectable(styles[i], isSelected, ImGuiSelectableFlags.DontClosePopups))
+                {
+                    if (!isSelected)
+                        selectedIndices.Add(i);
+                    else
+                        selectedIndices.Remove(i);
+
+                    didChange = true;
+                }
+
+                if (isSelected)
+                    ImGui.SetItemDefaultFocus();
+            }
+
+            ImGui.EndCombo();
+        }
+
+        return didChange;
+    }


For the backends with custom styles enabled (ElevenLabs and OpenAI). User can now select multiple styles from whatever they have configured. The backends do a decent job (sometimes) of mixing them as intended (example: Accent: Irish and Tone: Shouting)

mitcheb0219 · 2026-01-22T19:50:08Z

src/TextToTalk/Backends/GoogleCloud/GoogleCloudBackendUI.cs

-        var validSampleRates = new[] { "8000", "16000", "22050", "24000" };
-        var sampleRate = currentVoicePreset.SampleRate.ToString();
-        var sampleRateIndex = Array.IndexOf(validSampleRates, sampleRate);
-        if (ImGui.Combo($"Sample rate##{MemoizedId.Create()}", ref sampleRateIndex, validSampleRates,
-                validSampleRates.Length))
-        {
-            currentVoicePreset.SampleRate = int.Parse(validSampleRates[sampleRateIndex]);
-            this.config.Save();
-        }
-
-        var pitch = currentVoicePreset.Pitch ?? 0;
-        if (ImGui.SliderFloat($"Pitch##{MemoizedId.Create()}", ref pitch, -10f, 10f, "%.2fx"))
-        {
-            currentVoicePreset.Pitch = pitch;
-            config.Save();
-        }
-


Moving GoogleCloud to its streaming capability made it necessary for these sliders to be removed. This is because only Chirp and HD-3 voices are capable of streaming their output and these voices cannot have their sample rates or pitches changed in the request. This is being removed to avoid unintended Bad Request responses from the GoogleCloud SDK.

mitcheb0219 · 2026-01-22T19:57:11Z

src/TextToTalk/Backends/GoogleCloud/GoogleCloudVoicePreset.cs

-    // 0.25 - 4.0 (default 1.0)
+    // 0.25 - 2.0 (default 1.0)


adjusted within GoogleCloud Chirp voice maximums

mitcheb0219 · 2026-01-22T19:59:51Z

src/TextToTalk/Backends/Kokoro/KokoroSoundQueue.cs

+            if (nextItem.Aborted) break;
+
+            var samples = model.Infer(chunk, nextItem.Voice.Features, nextItem.Speed);
+            byte[] bytes = KokoroPlayback.GetBytes(samples);
+
+            // POST-INFERENCE ABORT CHECK: Prevent enqueuing "zombie" audio
+            if (nextItem.Aborted) break;
+
+            lock (this.soundLock)
+            {
+                if (this.bufferedProvider != null && this.soundOut != null)
+                {
+                    this.bufferedProvider.AddSamples(bytes, 0, bytes.Length);
+                    if (this.soundOut.PlaybackState != PlaybackState.Playing)
+                    {
+                        if (nextItem.StartTime.HasValue)
+                        {
+                            var elapsed = Stopwatch.GetElapsedTime(nextItem.StartTime.Value);
+                            this.latencyTracker.AddLatency(elapsed.TotalMilliseconds);
+                            Log.Debug("Total Latency (Say -> Play): {Ms}", elapsed.TotalMilliseconds);
+                        }
+                        this.soundOut.Play();
+                    }
+                }
+            }


Kokoro still has its own SoundQueue. This probably could be joined into the StreamingSoundQueue class via the EnqueueSound method.

mitcheb0219 · 2026-01-22T20:01:38Z

src/TextToTalk/Backends/OpenAI/OpenAiBackendUI.cs

+                    foreach (var styleName in config.CustomVoiceStyles)
+                    {
+                        bool isSelected = currentVoicePreset.Styles.Contains(styleName);
+
+                        if (ImGui.Selectable(styleName, isSelected, ImGuiSelectableFlags.DontClosePopups))
+                        {
+                            if (isSelected)
+                                currentVoicePreset.Styles.Remove(styleName);
+                            else
+                                currentVoicePreset.Styles.Add(styleName);
+
+                            currentVoicePreset.SyncStringFromSet();
+                            this.config.Save();
+                        }
+                    }
+                    ImGui.EndCombo();


OpenAI Voice styles now allows the user to select multiple styles from the list. They will concatenate together as part of the voice preset and be sent with every TTS request. This is very similar to the already-existing instructions field for OpenAI but ties in to the new VoiceStyles config window.

mitcheb0219 · 2026-01-22T20:05:13Z

src/TextToTalk/Backends/Piper/PiperBackend.cs

Entirely new backend that is built to function similar to Kokoro. Upon selection, engine is downloaded/installed within TextToTalk directory and a single voice is downloaded with it. User then has the option to download/delete additional voices at their discretion. UI element for voice directory size added for user visibility as the total suite CAN take up to 9GB of drive storage.

mitcheb0219 · 2026-01-22T20:07:35Z

src/TextToTalk/Backends/StreamFormat.cs

+    Azure,
+    System,
+    Uberduck,
+    Piper,
+    PiperLow,
+    PiperHigh,


Additional StreamFormats added to ensure proper routing in StreamingSoundQueue class.

mitcheb0219 · 2026-01-22T20:08:40Z

src/TextToTalk/Backends/StreamSoundQueue.cs

+                Log.Information("Playing");
                this.soundOut = new DirectSoundOut(playbackDeviceId);
                this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); };
                this.soundOut.Init(volumeSampleProvider);


These changes were just to log initial timestamps for audio playback. This class has been left unchanged however I believe this build will have left 0 references to it.

IMO we should just delete this class if it's unused.

mitcheb0219 · 2026-01-22T20:10:58Z

src/TextToTalk/Backends/StreamingSoundQueue.cs

New class built to handle Streaming requests. Network vs Seekable streams have their logic splits as well as compressed/uncompressed stream audio which varies from backend to backend.

Ultimately the LatencyTracker logic ends up here to keep track of timestamp diffs so we can empirically see if this whole effort has resulted in better performance.

Audio device selection capability is preserved as well.

mitcheb0219 · 2026-01-22T20:14:36Z

src/TextToTalk/Backends/System/SystemSoundQueue.cs

Because Microsoft Speech Synthesis is so rigid about its audio playback. System now has its own sound queue again. Though it does utilize streaming, it required a custom bridge to be built to allow the bytes to stream from the SpeechAsync request to the playback device. Because of this, it made sense to keep that custom code contained here.

mitcheb0219 · 2026-01-22T20:17:37Z

src/TextToTalk/Backends/Uberduck/UberduckBackend.cs

Uberduck needed a major rework as this backend didn't function at all since they updated to a versioned API.

Even with the streaming capabilities, their API still locks any audio from being downloaded until the entire synthesis is complete. So latency is still kind of bad with this one.

mitcheb0219 · 2026-01-22T20:20:55Z

src/TextToTalk/GameEnums/AdditionalChatType.cs

        Gathering = 67,
        FCAnnouncement = 69,
        FCLogin = 70,
-        RetainerSale = 71,
+        //RetainerSale = 71,
        PartyFinderState = 72,
        ActionUsedOnYou = 2091,
        FailedActionUsedOnYou = 2218,


this can be ignored as it was part of the last bugfix. Just forgot to sync up

mitcheb0219 · 2026-01-22T20:21:57Z

src/TextToTalk/TextProviders/ChatMessageHandler.cs

        if (!this.filters.OnlyMessagesFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return;

        if (!this.filters.ShouldSayFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return;
-
-        OnTextEmit.Invoke(new ChatTextEmitEvent(
-            GetCleanSpeakerName(speaker, sender),
-            textValue,
-            speaker,
-            type));
+
+        else if (type == XivChatType.TellOutgoing && config.SkipMessagesFromYou == true) return;
+
+            OnTextEmit.Invoke(new ChatTextEmitEvent(
+                GetCleanSpeakerName(speaker, sender),
+                textValue,
+                speaker,
+                type));


this can also be ignored as it was part of the last bugfix regarding outgoing tells

karashiiro · 2026-01-22T23:12:53Z

src/TextToTalk/Backends/Azure/AzureClient.cs

    {
        this.synthesizer?.Dispose();
        this.soundQueue?.Dispose();
+        this.soundQueue?.Dispose();


nit: duplicate Dispose() call

karashiiro · 2026-01-22T23:15:41Z

src/TextToTalk/Backends/OpenAI/OpenAiBackendUIModel.cs

+            apiKey = (credentials.Password);
+        }
+        //RawStreamingSoundQueue = new RawStreamingSoundQueue(config);
+        OpenAi = new OpenAiClient(SoundQueue, apiKey);


What happens if the credentials are incorrect here (if nothing happens that's fine)?

Because the API key is checked in the UI when it's entered, an invalid key won't get saved.

I tried revoking my OpenAI key after having it saved in the CredentialManager and sending a speech request. It simply returned a 401 unauthorized error but no crashes.

karashiiro · 2026-01-22T23:17:06Z

src/TextToTalk/Backends/StreamSoundQueue.cs

+                Log.Information("Playing");
                this.soundOut = new DirectSoundOut(playbackDeviceId);
                this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); };
                this.soundOut.Init(volumeSampleProvider);


IMO we should just delete this class if it's unused.

karashiiro · 2026-01-22T23:20:20Z

src/TextToTalk/Backends/GoogleCloud/GoogleCloudVoicePreset.cs

-    public int? SampleRate { get; set; }
-
-    // -20.0 - 20.0 is theoretical max, but it's lowered to work better with sliders (default 0.0)
-    public float? Pitch { get; set; }


Are these supposed to be removed?

Yes for Pitch, because a key to utilizing streaming with Google Cloud is to only call the voices/voice types that are enabled for streaming. These voices are more lifelike and do not have toggles for Pitch.

Sample Rate was supposed to be a temporary omission until I got the playback working. I've added it back in coming up in the next commit.

karashiiro · 2026-01-22T23:21:46Z

src/TextToTalk/Backends/GoogleCloud/GoogleCloudClient.cs

        foreach (var voice in response.Voices)
        {
-            fetchedVoices.Add(voice.Name, new
+            if (voice.Name.Contains("Chirp3") || voice.Name.Contains("Chirp-HD")) // Focusing on Chirp 3 and Chirp HD voices as these are the only ones enabled for streaming.  From what I can tell, this actually reduces duplicates of the same voice under different formats.


How does this behave when someone already had a different engine in use before this?

Hmm.

---> Grpc.Core.RpcException: Status(StatusCode="InvalidArgument", Detail="Currently, only Chirp 3: HD voices are supported for streaming synthesis.")

This response comes back from the API. What do you think should be done here? Force onto a new voice? Migrate the presets? Or keep the old non-streaming enabled voices and fork the audio processing?

Turns out it wasn't too hard to branch the voice processing based on a boolean that's determined by the voice name. So this way all voices can be kept in the backend, just now HD voices will leverage bi-directional streaming.

karashiiro · 2026-01-22T23:28:44Z

src/TextToTalk/UI/Windows/StatsWindow.cs

+    private float[] dataArray = Array.Empty<float>();
+    private DateTime lastUpdateTime = DateTime.MinValue;
+    private readonly object updateLock = new();
+    public static StatsWindow? Instance { get; private set; }


We can retrieve the window instance from WindowSystem, so we shouldn't need a static reference here.

This was my way of shortcutting the need to pass the window instance into the Configuration Window. It's the same way I configured the Voice Styles window so it could be opened independently of the Config Window.

karashiiro · 2026-01-22T23:30:21Z

src/TextToTalk/Backends/StreamingSoundQueue.cs

+            }
+
+            // 2. Branch logic based on format (Encoded vs Raw)
+            if (nextItem.Format == StreamFormat.Mp3 || nextItem.Format == StreamFormat.Uberduck)


This appears to be the only place where we use StreamFormat.Uberduck, so we can just remove it and use StreamFormat.Mp3 instead.

Change made for next commit

karashiiro · 2026-01-22T23:33:33Z

src/TextToTalk/Backends/StreamingSoundQueue.cs

+
+        private LatencyTracker latencyTracker = latencyTracker;
+
+        public void EnqueueSound(Stream data, TextSource source, float volume, StreamFormat format, HttpResponseMessage? response, long? timeStamp)


Maybe we should add a waveFormat parameter to this method, and to StreamingSoundQueueItem? This class shouldn't need to know the audio formats of each backend beyond the actual encoding. That way we can get rid of all of the provider-specific StreamFormat values, I think.

In the next Commit I will have cleaned up the extraneous formats and streamlined the raw PCM Streams into one of four Wave Formats.

All are 16-bit mono.

8kHz
16kHz
22.05kHz
24kHz

Any backends that leverage Mp3 have their SampleRates read directly from the frames and do not require the rate to be given. Polly is an example.

karashiiro · 2026-01-22T23:35:08Z

src/TextToTalk/Backends/StreamingSoundQueue.cs

+                    {
+                        frame = Mp3Frame.LoadFromStream(readFullyStream);
+                    }
+                    catch (Exception) // Catching interruptions here


Can we catch a more specific class here?

I've actually removed this try block altogether. It was originally put in when I was figuring out how to gracefully cancel the synthesis generation while playback was occurring. Initially the cancellations generated unhandled exceptions so this was put in to smooth out the testing process. It's no longer needed.

karashiiro · 2026-01-22T23:39:30Z

src/TextToTalk/Backends/OpenAI/OpenAiBackend.cs

    {
        this.uiModel.SoundQueue.CancelAllSounds();
+
+        if (uiModel.OpenAi._ttsCts != null)


We should add a Cancel() method on OpenAiClient and possibly also the UIModel instead of accessing deep internals like this. Same applies to CancelSay.

I did something similar for Elevenlabs as well. I was trying to avoid having to pass the Client class into the backend class in order to avoid a circular dependency, or to have to re-order the way the backend gets initialized. But agreed this one feels bad.

I'll need to look closer at the other backends (like GoogleCloud) and mimic its constructors. It'll be the next change I write up.

Refactored OpenAI and Elevenlabs backends to closer align to other backends in terms of structure. This reduces the amount of layers these methods have to reach in order to call the cancellation tokens.

mitcheb0219 added 12 commits January 11, 2026 21:54

Major Refactor - Audio Streaming Across All Backends

5c14124

Backend Changes for Azure, OpenAI

62ef46c

Added Logging to measure time to first audio

Added Piper Backend

31b39d2

Cleaned up Piper UI. Streaming Pipe

d62fdf4

Added Voice Downloader in Piper for user control

098cfe5

Fixed up Piper Downloader Tool

0f5b230

Cleaned up Piper UI Model Selection

dcff8c5

Made Elevenlabs UI more resilient to model indexes. Updated Piper Dro…

17bd103

…pdown

Fixed up Voice Cancellation across all backends to work with streamin…

867d2ef

…g refactor

Added "/tttstats" so user can track average latency for TTS

bcd063b

Update StatsWindow.cs

1bc7e78

Updated Latency Tracker. Added button to Config Window

5fd9783

mitcheb0219 commented Jan 22, 2026

View reviewed changes

karashiiro reviewed Jan 22, 2026

View reviewed changes

Updates according to notes in existing PR

15f0565


		private LatencyTracker latencyTracker = latencyTracker;

		public void EnqueueSound(Stream data, TextSource source, float volume, StreamFormat format, HttpResponseMessage? response, long? timeStamp)

Conversation

mitcheb0219 commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitcheb0219 Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitcheb0219 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mitcheb0219 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

mitcheb0219 Jan 22, 2026 •

edited

Loading

mitcheb0219 Jan 23, 2026 •

edited

Loading

mitcheb0219 Jan 23, 2026 •

edited

Loading