Zero Shot RVC #1195

rasenganai · 2025-10-13T11:38:35Z

rasenganai
Oct 13, 2025

Hi Team, F5 models works really well, in terms of copying the tonality as well as Identity.
Thanks for open-sourcing this amazing repo.

I'm exploring voice cloning such as RVC models which take speech input converts it into hubert embeddings and train a GAN model on top of it to get the waveform back, at inference the input speech could be of any speaker and the trained model superimpose the target speaker identity keeping the semantics same.

I wanted to know what do you think about releasing a Better RVC Alternate (Zero-SHOT) by following the same training recipe as F5 but instead use wav2vec2/hubert features as input?
May I suggest using seamless wav2vec2-xls-r-1b, to get these features and training the model.
I think the community would really like a Zero-Shot RVC alternate. I don't have the compute to execute this but would really like to know your thoughts.

SWivid · 2025-10-13T11:41:47Z

SWivid
Oct 13, 2025
Maintainer

cc @Jerrister

0 replies

Jerrister · 2025-10-13T13:08:21Z

Jerrister
Oct 13, 2025

Hi @rasenganai

I do have some ideas on using F5 architecture for zero-shot VC, but I am not quite understand how F5 could help RVC. Can you explain your idea more clearly?

I guess works like Seed-VC or vec2wav 2.0 could be relevant to your idea?

0 replies

rasenganai · 2025-10-13T14:32:41Z

rasenganai
Oct 13, 2025
Author

Hi @Jerrister ,
Seed-VC is really interesting, I was proposing the same but without any timbre change and or speaker verification embedding, which I think is even better.
I was thinking of simply replacing the text conditioning part of the current F5 model with hubert features.

Current F5 takes:
(text,melSpectrogram) -> f5 for training

for Voice Conversion:
original_speech -> hubert -> semantic_features
(semantic_features,original_Audio_melspec) -> F5 for training.

So we condition it on semantic information (continuous or k-means) instead of text.

I will try Seed-VC.
I didn't understand why we are appending speaker verification embedding in Seed-VC, since audio prompt mel-spectogram is there.
Author also showed that reference file weighed more in Speaker Similarity.

Please let me know your thoughts.

0 replies

Jerrister · 2025-10-14T04:31:08Z

Jerrister
Oct 14, 2025

I get your point now.

I think the key is to make sure that your semantic features do not contain speaker information. Otherwise, synthetic parallel data is needed, like what Seed-VC did.

0 replies

rasenganai · 2025-10-14T05:52:23Z

rasenganai
Oct 14, 2025
Author

Yes, that makes sense.
Timbre shifting achieves that.
Thanks for providing your insights

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Shot RVC #1195

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Zero Shot RVC #1195

Uh oh!

rasenganai Oct 13, 2025

Replies: 5 comments

Uh oh!

SWivid Oct 13, 2025 Maintainer

Uh oh!

Jerrister Oct 13, 2025

Uh oh!

rasenganai Oct 13, 2025 Author

Uh oh!

Jerrister Oct 14, 2025

Uh oh!

rasenganai Oct 14, 2025 Author

rasenganai
Oct 13, 2025

SWivid
Oct 13, 2025
Maintainer

Jerrister
Oct 13, 2025

rasenganai
Oct 13, 2025
Author

Jerrister
Oct 14, 2025

rasenganai
Oct 14, 2025
Author