Skip to content

Streaming SoundQueue refactor. Addition of Latency Metrics. Addition of Piper backend#268

Open
mitcheb0219 wants to merge 13 commits intokarashiiro:mainfrom
mitcheb0219:main
Open

Streaming SoundQueue refactor. Addition of Latency Metrics. Addition of Piper backend#268
mitcheb0219 wants to merge 13 commits intokarashiiro:mainfrom
mitcheb0219:main

Conversation

@mitcheb0219
Copy link
Collaborator

Extensive rework of SoundQueues to utilize streaming to a buffer. This allows playback while synthesis is still in progress. For some backends this has significantly reduced latency.

Made changes for Uberduck backend to support their versioned API and leveraging of API Keys.

Added Piper Backend. User has full control over downloaded voices and can monitor size of voice directory in the ConfigWindow.

Added Latency Tracker (accessible via ConfigWindow or /tttstats command. This shows latency for all TTS requests during session. Stats can be manually cleared for testing. This is to give the user an idea of how long it takes for time to first audio.

bool includeSpeakAttributes = true)

{
text = System.Security.SecurityElement.Escape(text);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be ignored as it was already added as a bugfix.

Comment on lines +262 to +287
"Microsoft.ML.OnnxRuntime.Gpu": {
"type": "Transitive",
"resolved": "1.23.2",
"contentHash": "4GNQUc6FHiWHvp95Yhu95SUDa6HVm+RSQxm7QCH3PIlderDhTPdU98fHHKXmLy4xIQikkEraMcGe+KXEQU5tew==",
"dependencies": {
"Microsoft.ML.OnnxRuntime.Gpu.Linux": "1.23.2",
"Microsoft.ML.OnnxRuntime.Gpu.Windows": "1.23.2",
"Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
}
},
"Microsoft.ML.OnnxRuntime.Gpu.Linux": {
"type": "Transitive",
"resolved": "1.23.2",
"contentHash": "bcv2zpP8GNnfdUCkOjE9lzIoslAOCuY0T9QHpI5+Qm6qUcehRPtGC8wF4nvySwyfTe0g3rVINP3SSj1zinkE7Q==",
"dependencies": {
"Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
}
},
"Microsoft.ML.OnnxRuntime.Gpu.Windows": {
"type": "Transitive",
"resolved": "1.23.2",
"contentHash": "qOU3DVcxq4XalFV3wlrNrdatYWufIqvg8FZqVC3LS2rFPoTfl++xpMC2nnaxB2Wc5jrpDb2izrcDsQatCyjVnA==",
"dependencies": {
"Microsoft.ML.OnnxRuntime.Managed": "1.23.2"
}
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had removed these before pushing. Was making attempts to utilize CUDA for synth processing but the entire suite was proving to be too bulky/cumbersome for only marginal gains.

Comment on lines +79 to +82
long methodStart = Stopwatch.GetTimestamp();
_ttsCts?.Cancel();
_ttsCts = new CancellationTokenSource();
var token = _ttsCts.Token;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cancellation Tokens were added to each backend's Say method in order to allow cancellation at all 3 stages:

  1. Before Synthesis Starts (cancellation token)
  2. During Synthesis / Playback (cancellation token and StopHardware)
  3. After Synthesis / During Playback (StopHardware)

Comment on lines -103 to -125
private static void HandleResult(SpeechSynthesisResult res)
{
if (res.Reason == ResultReason.Canceled)
{
var cancellation = SpeechSynthesisCancellationDetails.FromResult(res);
if (cancellation.Reason == CancellationReason.Error)
{
DetailedLog.Error($"Azure request error: ({cancellation.ErrorCode}) \"{cancellation.ErrorDetails}\"");
}
else
{
DetailedLog.Warn($"Azure request failed in state \"{cancellation.Reason}\"");
}

return;
}

if (res.Reason != ResultReason.SynthesizingAudioCompleted)
{
DetailedLog.Warn($"Speech synthesis request completed in incomplete state \"{res.Reason}\"");
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the new synth method, it didn't seem this was necessary any more. I could be wrong, but wanted to make a note of it here so it gets attention.

Comment on lines +147 to +180
public static bool ImGuiStylesCombo(string label, string previewText, SortedSet<int> selectedIndices, List<string> styles)
{
// Use the passed-in string, or a placeholder if it's empty
string displayValue = !string.IsNullOrEmpty(previewText) ? previewText : "None selected";

bool didChange = false;

// The second parameter of BeginCombo controls what is shown in the closed box
if (ImGui.BeginCombo(label, displayValue))
{
for (int i = 0; i < styles.Count; i++)
{
bool isSelected = selectedIndices.Contains(i);

// Use Selectable with DontClosePopups for multi-select
if (ImGui.Selectable(styles[i], isSelected, ImGuiSelectableFlags.DontClosePopups))
{
if (!isSelected)
selectedIndices.Add(i);
else
selectedIndices.Remove(i);

didChange = true;
}

if (isSelected)
ImGui.SetItemDefaultFocus();
}

ImGui.EndCombo();
}

return didChange;
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the backends with custom styles enabled (ElevenLabs and OpenAI). User can now select multiple styles from whatever they have configured. The backends do a decent job (sometimes) of mixing them as intended (example: Accent: Irish and Tone: Shouting)

Comment on lines 105 to 121
var validSampleRates = new[] { "8000", "16000", "22050", "24000" };
var sampleRate = currentVoicePreset.SampleRate.ToString();
var sampleRateIndex = Array.IndexOf(validSampleRates, sampleRate);
if (ImGui.Combo($"Sample rate##{MemoizedId.Create()}", ref sampleRateIndex, validSampleRates,
validSampleRates.Length))
{
currentVoicePreset.SampleRate = int.Parse(validSampleRates[sampleRateIndex]);
this.config.Save();
}

var pitch = currentVoicePreset.Pitch ?? 0;
if (ImGui.SliderFloat($"Pitch##{MemoizedId.Create()}", ref pitch, -10f, 10f, "%.2fx"))
{
currentVoicePreset.Pitch = pitch;
config.Save();
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving GoogleCloud to its streaming capability made it necessary for these sliders to be removed. This is because only Chirp and HD-3 voices are capable of streaming their output and these voices cannot have their sample rates or pitches changed in the request. This is being removed to avoid unintended Bad Request responses from the GoogleCloud SDK.

Comment on lines 14 to 9
// 0.25 - 4.0 (default 1.0)
// 0.25 - 2.0 (default 1.0)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted within GoogleCloud Chirp voice maximums

Comment on lines +79 to +103
if (nextItem.Aborted) break;

var samples = model.Infer(chunk, nextItem.Voice.Features, nextItem.Speed);
byte[] bytes = KokoroPlayback.GetBytes(samples);

// POST-INFERENCE ABORT CHECK: Prevent enqueuing "zombie" audio
if (nextItem.Aborted) break;

lock (this.soundLock)
{
if (this.bufferedProvider != null && this.soundOut != null)
{
this.bufferedProvider.AddSamples(bytes, 0, bytes.Length);
if (this.soundOut.PlaybackState != PlaybackState.Playing)
{
if (nextItem.StartTime.HasValue)
{
var elapsed = Stopwatch.GetElapsedTime(nextItem.StartTime.Value);
this.latencyTracker.AddLatency(elapsed.TotalMilliseconds);
Log.Debug("Total Latency (Say -> Play): {Ms}", elapsed.TotalMilliseconds);
}
this.soundOut.Play();
}
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kokoro still has its own SoundQueue. This probably could be joined into the StreamingSoundQueue class via the EnqueueSound method.

Comment on lines +190 to +205
foreach (var styleName in config.CustomVoiceStyles)
{
bool isSelected = currentVoicePreset.Styles.Contains(styleName);

if (ImGui.Selectable(styleName, isSelected, ImGuiSelectableFlags.DontClosePopups))
{
if (isSelected)
currentVoicePreset.Styles.Remove(styleName);
else
currentVoicePreset.Styles.Add(styleName);

currentVoicePreset.SyncStringFromSet();
this.config.Save();
}
}
ImGui.EndCombo();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAI Voice styles now allows the user to select multiple styles from the list. They will concatenate together as part of the voice preset and be sent with every TTS request. This is very similar to the already-existing instructions field for OpenAI but ties in to the new VoiceStyles config window.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entirely new backend that is built to function similar to Kokoro. Upon selection, engine is downloaded/installed within TextToTalk directory and a single voice is downloaded with it. User then has the option to download/delete additional voices at their discretion. UI element for voice directory size added for user visibility as the total suite CAN take up to 9GB of drive storage.

Comment on lines 8 to 13
Azure,
System,
Uberduck,
Piper,
PiperLow,
PiperHigh,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional StreamFormats added to ensure proper routing in StreamingSoundQueue class.

Log.Information("Playing");
this.soundOut = new DirectSoundOut(playbackDeviceId);
this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); };
this.soundOut.Init(volumeSampleProvider);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were just to log initial timestamps for audio playback. This class has been left unchanged however I believe this build will have left 0 references to it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just delete this class if it's unused.

Copy link
Collaborator Author

@mitcheb0219 mitcheb0219 Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New class built to handle Streaming requests. Network vs Seekable streams have their logic splits as well as compressed/uncompressed stream audio which varies from backend to backend.

Ultimately the LatencyTracker logic ends up here to keep track of timestamp diffs so we can empirically see if this whole effort has resulted in better performance.

Audio device selection capability is preserved as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because Microsoft Speech Synthesis is so rigid about its audio playback. System now has its own sound queue again. Though it does utilize streaming, it required a custom bridge to be built to allow the bytes to stream from the SpeechAsync request to the playback device. Because of this, it made sense to keep that custom code contained here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uberduck needed a major rework as this backend didn't function at all since they updated to a versioned API.

Even with the streaming capabilities, their API still locks any audio from being downloaded until the entire synthesis is complete. So latency is still kind of bad with this one.

Comment on lines 10 to 16
Gathering = 67,
FCAnnouncement = 69,
FCLogin = 70,
RetainerSale = 71,
//RetainerSale = 71,
PartyFinderState = 72,
ActionUsedOnYou = 2091,
FailedActionUsedOnYou = 2218,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be ignored as it was part of the last bugfix. Just forgot to sync up

Comment on lines 127 to +137
if (!this.filters.OnlyMessagesFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return;

if (!this.filters.ShouldSayFromYou(speaker?.Name.TextValue ?? sender.TextValue)) return;

OnTextEmit.Invoke(new ChatTextEmitEvent(
GetCleanSpeakerName(speaker, sender),
textValue,
speaker,
type));

else if (type == XivChatType.TellOutgoing && config.SkipMessagesFromYou == true) return;

OnTextEmit.Invoke(new ChatTextEmitEvent(
GetCleanSpeakerName(speaker, sender),
textValue,
speaker,
type));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can also be ignored as it was part of the last bugfix regarding outgoing tells

{
this.synthesizer?.Dispose();
this.soundQueue?.Dispose();
this.soundQueue?.Dispose();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: duplicate Dispose() call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

apiKey = (credentials.Password);
}
//RawStreamingSoundQueue = new RawStreamingSoundQueue(config);
OpenAi = new OpenAiClient(SoundQueue, apiKey);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the credentials are incorrect here (if nothing happens that's fine)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the API key is checked in the UI when it's entered, an invalid key won't get saved.

I tried revoking my OpenAI key after having it saved in the CredentialManager and sending a speech request. It simply returned a 401 unauthorized error but no crashes.

Log.Information("Playing");
this.soundOut = new DirectSoundOut(playbackDeviceId);
this.soundOut.PlaybackStopped += (_, _) => { this.speechCompleted.Set(); };
this.soundOut.Init(volumeSampleProvider);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should just delete this class if it's unused.

Comment on lines -7 to -10
public int? SampleRate { get; set; }

// -20.0 - 20.0 is theoretical max, but it's lowered to work better with sliders (default 0.0)
public float? Pitch { get; set; }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these supposed to be removed?

Copy link
Collaborator Author

@mitcheb0219 mitcheb0219 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for Pitch, because a key to utilizing streaming with Google Cloud is to only call the voices/voice types that are enabled for streaming. These voices are more lifelike and do not have toggles for Pitch.

Sample Rate was supposed to be a temporary omission until I got the playback working. I've added it back in coming up in the next commit.

foreach (var voice in response.Voices)
{
fetchedVoices.Add(voice.Name, new
if (voice.Name.Contains("Chirp3") || voice.Name.Contains("Chirp-HD")) // Focusing on Chirp 3 and Chirp HD voices as these are the only ones enabled for streaming. From what I can tell, this actually reduces duplicates of the same voice under different formats.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this behave when someone already had a different engine in use before this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.

---> Grpc.Core.RpcException: Status(StatusCode="InvalidArgument", Detail="Currently, only Chirp 3: HD voices are supported for streaming synthesis.")

This response comes back from the API. What do you think should be done here? Force onto a new voice? Migrate the presets? Or keep the old non-streaming enabled voices and fork the audio processing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it wasn't too hard to branch the voice processing based on a boolean that's determined by the voice name. So this way all voices can be kept in the backend, just now HD voices will leverage bi-directional streaming.

private float[] dataArray = Array.Empty<float>();
private DateTime lastUpdateTime = DateTime.MinValue;
private readonly object updateLock = new();
public static StatsWindow? Instance { get; private set; }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can retrieve the window instance from WindowSystem, so we shouldn't need a static reference here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my way of shortcutting the need to pass the window instance into the Configuration Window. It's the same way I configured the Voice Styles window so it could be opened independently of the Config Window.

}

// 2. Branch logic based on format (Encoded vs Raw)
if (nextItem.Format == StreamFormat.Mp3 || nextItem.Format == StreamFormat.Uberduck)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be the only place where we use StreamFormat.Uberduck, so we can just remove it and use StreamFormat.Mp3 instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made for next commit


private LatencyTracker latencyTracker = latencyTracker;

public void EnqueueSound(Stream data, TextSource source, float volume, StreamFormat format, HttpResponseMessage? response, long? timeStamp)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a waveFormat parameter to this method, and to StreamingSoundQueueItem? This class shouldn't need to know the audio formats of each backend beyond the actual encoding. That way we can get rid of all of the provider-specific StreamFormat values, I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next Commit I will have cleaned up the extraneous formats and streamlined the raw PCM Streams into one of four Wave Formats.

All are 16-bit mono.

8kHz
16kHz
22.05kHz
24kHz

Any backends that leverage Mp3 have their SampleRates read directly from the frames and do not require the rate to be given. Polly is an example.

{
frame = Mp3Frame.LoadFromStream(readFullyStream);
}
catch (Exception) // Catching interruptions here
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we catch a more specific class here?

Copy link
Collaborator Author

@mitcheb0219 mitcheb0219 Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've actually removed this try block altogether. It was originally put in when I was figuring out how to gracefully cancel the synthesis generation while playback was occurring. Initially the cancellations generated unhandled exceptions so this was put in to smooth out the testing process. It's no longer needed.

{
this.uiModel.SoundQueue.CancelAllSounds();

if (uiModel.OpenAi._ttsCts != null)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a Cancel() method on OpenAiClient and possibly also the UIModel instead of accessing deep internals like this. Same applies to CancelSay.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something similar for Elevenlabs as well. I was trying to avoid having to pass the Client class into the backend class in order to avoid a circular dependency, or to have to re-order the way the backend gets initialized. But agreed this one feels bad.

I'll need to look closer at the other backends (like GoogleCloud) and mimic its constructors. It'll be the next change I write up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored OpenAI and Elevenlabs backends to closer align to other backends in terms of structure. This reduces the amount of layers these methods have to reach in order to call the cancellation tokens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants