Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some doubt about any to any voice conversion #6

Open
980202006 opened this issue Sep 3, 2021 · 98 comments
Open

Some doubt about any to any voice conversion #6

980202006 opened this issue Sep 3, 2021 · 98 comments
Labels
discussion New research topic

Comments

@980202006
Copy link

Hi,thanks for this project. I have tried to remove the domain information of the style encoder, which does have a certain effect and can generate natural sound, but there are the following problems:

  1. Low similarity with target speaker
  2. The sound quality decreased significantly
    The reconstruction effect is better by inputting the original audio to the style encoder.

Data used:

  1. 60 speakers: 20 English speakers, 6 Chinese singer, one English and Korean singer (1 speaker), 12 English singer, and the rest are Chinese speech
    Batch size: 32 (8 per GPU)
    Can you provide some suggestions, whether data or model?
@980202006
Copy link
Author

In addition, the mapping network training is even worse

@980202006
Copy link
Author

Whether to remove the mapping network, if I only use it with reference audio.

@yl4579
Copy link
Owner

yl4579 commented Sep 3, 2021

What you are asking is an open research question that nobody has an answer to at this point, but I will give you my two cents on this issue. It is only a discussion, not meant to provide any viable solutions.

In general, there are mainly two ways to do voice conversion:

  • one being that using speaker-independent information to reconstruct the speech
  • another being that using discriminators to make the converted speech sound like the target speakers.

The first method usually suffers from poor sound quality because it is difficult to completely disentangle speakers from speech while keeping enough information to reconstruct the speech with high quality (unless you use text labels which make it a TTS system and thus impossible to work in real-time), while the latter suffers from dissimilarity as the input speaker information is often leaked into the decoder. This paper introduces adversarial classifier loss to mitigate the second problem, so we can guarantee the converted results sound similar to the target speaker for seen input speakers and sometimes for unseen input speakers while maintaining a reasonable degree of naturalness in synthesized speech.

However, when it comes to zero-shot conversion, the trick of adversarial classifier loss is no longer applicable, because such a classifier is even not able to find patterns for only less than a hundred speakers, let alone thousands of speakers that are usually required to train zero-shot conversion models. In addition, if you read the original StarGAN v2 paper, you will see that the style encoder is trained to only reconstruct the image, and hence it works well for reconstruction but works poorly for conversion when the disentanglement in the encoder is not sufficient and when there're so many speakers that the style space becomes extremely complicated and the discriminator loses track of bad samples from the generator.

That is to say, if you want to do the zero-shot conversion, you will need to work heavily on improving the current discriminator settings. For example, build a set of discriminators each of which only works on a subset of speakers, or use speaker embeddings to help the model set the right goals for discriminations.

You can also instead disentangle the input speakers as much as possible and try to reconstruct the speech with the given style. There are several ways of disentangling the input speaker information, for example, Huang et. al. 2020. Another way is to use speaker agnostic features such that PPG and F0 to reconstruct the speech, but spoiler alert these features are usually not good enough to synthesize natural-sounding speech.

Of course, if you can find a way to make the adversarial classifier work in the zero-shot setting while keeping the same sound quality, I believe it will deserve a machine learning top conference publication such as in NIPS or ICML.

@yl4579 yl4579 added the discussion New research topic label Sep 3, 2021
@980202006
Copy link
Author

Thanks! I will try to add multi band loss like hifigan, and "SEQUENCE-TO-SEQUENCE SINGING VOICE SYNTHESIS
"With PERCEPTUAL ENTROPY LOSS" loss. If there is progress, I will share with you as soon as possible

@980202006
Copy link
Author

Hello, I recently tried some solutions to achieve any to any oice conversion. Simply increasing the number of speakers is the best result so far. I am trying to use x-vector as a style encoder recently. Is there anything I need to pay attention to?
In addition, I want to try a cross-domain conversion, like singing, but I encountered a problem: when the F0 of the source is low and the F0 of the target is high, F0 will jitter. In addition, it is difficult to further improve the similarity with the target speaker or singer. Are there any suggestions for improvement?

@980202006
Copy link
Author

In addition, you mentioned that when there are too many speakers, the speaker discriminator will be difficult to converge. Can you change its loss to other loss?

@yl4579
Copy link
Owner

yl4579 commented Sep 22, 2021

Sorry for the late reply. I hope you've got some good results using x-vector, though I believe it would not work better than style encoder alone because x-vector has much less information about the target speaker than the trained style encoder does.

The jittering F0 is probably caused by how the F0 features are processed by the encoder. It is only processed by a single ResBlock, which is unlikely to remove all the input F0 information. The subsequent AdaIN blocks have to transform these low-pitch features to high-pitch features, making it difficult and inevitably lose detailed information and thus jitter. My suggestion is you add a few more instance normalization layers to process the F0 feature and hopefully the features fed into the decoder only contain the pitch curves instead of the exact F0 value in Hz, which is what the model was trained for.

The problems of low similarity with a large number of speakers are probably caused by the limited capacity of the discriminators. I do not have any good suggestions for you, but you may try something like large hypernetworks that generate weights of discriminators for each individual speaker after some shared layers to further process speaker-specific characteristics. This can also be applied to the mapping network. The basic idea is to make the discriminators powerful enough to memorize the characteristics of each speaker. Another very simple way is to have multiple discriminators, each of which only acts on a specific set of speakers. For example, discriminator 1 is trained on speakers 1 to 10, 2 is trained on 11 to 20, and so on.

@980202006
Copy link
Author

Thank You!I will try it.If there is any progress, I will share with you as soon as possible.

@980202006
Copy link
Author

Using multiple discriminators is effective, and when the model converges, the sound quality on the unseen speaker is better, and the similarity to the target speaker is better than the original one.
If I use x-vector, the model can capture the sound characteristics that do not appear in the training set, but the sound quality is worse, and only capturing the characteristics does not improve the overall sound similarity very well. If you use the original style encoder, Many unseen sound characteristics will be lost.
Can you give some optimization suggestions?
In addition, I would like to ask how to fine-tune the unseen speaker based on the trained model, especially the discriminator does not reserve the index of unseen speake.

@yl4579
Copy link
Owner

yl4579 commented Oct 18, 2021

I think it depends on the number of speakers you have in the training set and what your latent space of the speaker embedding looks like. Usually, a multivariable Gaussian assumption is what people would use, so you may want to add an additional loss term to the latent variables from the style encoder or x-vector to enforce the underlying Gaussian distribution (an L2 norm would do the job). When you say many unseen sound characteristics are lost, what do you mean exactly by "sound characteristics"? Can you give some examples of the "lost characteristics" versus what the "characteristics" should actually be like?

Another way to test if the latent space actually encodes unseen speakers that are readily available to use by the generator is to use gradient descent to find the style that reconstructs the unseen speaker's speech. That is, after training your model, you simply fix everything and make the style vector a trainable parameter, and use the gradient descent to minimize the reconstruction loss between the input mel and output mel of unseen speakers. If the loss does not converge to a reasonable value, it means there's no style in the space the generator has learned to faithfully reconstruct unseen speakers' speech.

One easy way to finetune for unseen speakers is to simply remove the lost projection layer that converts the 512 channels to number of speakers. Another more complicated way is to use a hyperntework or weight AdaIN (see Chen et. al.
) so that the discriminator is speaker-independent but only style dependent. You will need to train a style encoder for the discriminator too though, or use a pretrained x-vector for that purpose.

@980202006
Copy link
Author

https://drive.google.com/drive/folders/1lQO7ZtWN6MvyZeMFwoB2L0AjDPL_9V1p?usp=sharing
Ref_wav is the target wav. Y_out is the output of model.1300y_out is obtained by replacing the style encoder with trainable parameters in 1300 steps and using stochastic gradient descent training. I found that the gradient is mainly concentrated in the InstanceNorm2d layer of the decoder. As y_out can hear, compared to ref_wav, some people's voice characteristics are lost.

@yl4579
Copy link
Owner

yl4579 commented Oct 20, 2021

I think 1300y_out is very similar to Ref_wav, so the good news is that the generator is capable of reconsrtucting unseen speakers without any further training. Have you tried to use the style obtained with gradient descent to convert other input audio? Does it work? If so, at least the model can do one-shot learning with a few iterations of gradient descent.

You're right that Y_out does not sound very similar to Ref_wav though, is this the result from X-vectors or style encoders without specific speakers? If the style obtained from gradient descent works with other input, it means the problem is not in the generator or the discriminator, but the style encoder that is unable to find a style embedding space with unseen speakers. If the style does not work with other input, it means the encoder of the generator may have been overfitted to reconstruct the input, so disentangling the input speaker information may be necessary.

@980202006
Copy link
Author

1300y_out is the result with style encoder. 1300x_vector_out is the result with x_vector.I test the style obtained from gradient descent on another song. 0y_out_huangmeixi_with_f0 is the result from model. 1300y_out_huangmeixi_with_f0 is the result with the style obtained from gradient descent.
If you give the wrong f0, the output will be out of tune in individual tones instead of all of them. 0y_out_huangmeixi_error_f0 is the result of the model output, using the f0 of the previous song in the current song. In other words, the encoder will also encode f0.
In addition, the style encoder will also affect the output environmental noise level.
The reconstruction loss (L1 loss) using SGD is as follows, printed once every 100steps.
image

@yl4579
Copy link
Owner

yl4579 commented Oct 21, 2021

This looks promising, so the problem probably is in the style encoder then. Can I know how many speakers you used to train the style encoder and how many discriminators were there and how you assigned these discriminators to those speakers?

By the way, I didn't see "0y_out_huangmeixi_error_f0", maybe you didn't upload it there, so I'm not sure what you meant by "In other words, the encoder will also encode f0."

It is expected that the style encoder encodes the background noise, and it is actually the most obvious thing it will encode given how the loss is set up. However, if you don't want it to encode the recording environment, you can use the contrastive loss to make it noise-robust. That is, generate a noise degenerated copy of your audio and make the style encoder encode both of them into the same style vector. This is also usually how speaker embeddings like x-vector are trained.

@980202006
Copy link
Author

Sorry for the late reply. A total of 117 speakers are used as the data set. There may be some noise in these data, including the sound of mouse clicks, pink noise, etc., but the sound is not loud. Twenty are English speech, and the rest are singings.
I re-uploaded “0y_out_huangmeixi_error_f0”.
Thank you!

@980202006
Copy link
Author

One discriminator for every 10 speakers. So here are 12 discriminators. I haven't had time to try other speaker and discriminator correspondences.I also did not try to share parameters between the discriminators.

@yl4579
Copy link
Owner

yl4579 commented Oct 26, 2021

I have listened to "0y_out_huangmeixi_error_f0" you uploaded and if I understand correctly, you probably think the style is somehow "overfitted" in the sense that it also encodes the F0 of the reconstruction target? I think this is not true, because a vector of size 64 can't encode a whole F0 curve, but one training objective is the average pitch of the reference is the same as the average pitch of the converted output, so it definitely learns the average F0. It also encodes how the pitch would deviate from the input F0 because the style diversification loss also tries to maximize the F0 between two different styles. Hence, the style also encodes some information about the speaking/singing style of the target, which is desirable in our case.

The discriminator settings seem fair, but how did you train the style encoder? Are you still using the unshared linear projection or the style encoder is now independent of the input speakers? What about the mapping network? Did you remove the mapping network in its entirety?

@980202006
Copy link
Author

Sorry for the late reply. I remove the mapping network. I use the origin network,unshared linear projection. Have you tried the improvements of stylegan2? According to my observation, if the sample input is fixed and optimized continuously with sgd, the gradient is mainly concentrated in instance normal.In addition, can bCR-GAN loss be replaced by StyleGAN2 with adaptive discriminator augmentation (ADA)?

@980202006
Copy link
Author

There is a problem with breathing sound modeling, is there a way to deal with it?

@yl4579
Copy link
Owner

yl4579 commented Nov 9, 2021

I don't think StyleGAN2 is relevant to StarGANv2, because the main difference in StyleGAN2 is they changed the instance normalization without the affine component (i.e., only normalize and learn the standard deviation, not the mean). The same setting hurts the performance in StarGANv2 as our model decodes from a latent space encoded by the encoder instead of noise, so it's not really that relevant. I believe StyleGAN3 is more relevant if you are willing to try to implement an aliasing-free generator instead.

As for ADA, I was not able to find a set of augmentation and probability such that no leaks occur, which is the main reason I was using bCR-GAN. The augmentation didn't matter that much if you have enough data, so it doesn't really help for the VCTK-20 dataset. I put it there only for cases where some speakers have much less than data others (like only 5 mins instead of 30 mins as in VCTK). It helps with emotional conversion and noisy datasets though.

I didn't encounter any problems with the breath sound. You can listen to the demo here and the breath can be heard clearly. I guess it's probably your dataset is noisy so the breath sound was filtered as noise by the encoder. In that case, you may want to intentionally corrupt your input by audio augmentation.

@yl4579
Copy link
Owner

yl4579 commented Nov 9, 2021

Back to the style encoder problem, how do you encode unseen speakers if you have unshared components?

@980202006
Copy link
Author

980202006 commented Nov 15, 2021

Sorry, there is a misunderstanding in the description here.
I converted the non-shared mapping to the shared mapping, as shown in the code below.

` class StyleEncoder(nn.Module):
def init(self, dim_in=48, style_dim=48, num_domains=2, max_conv_dim=384):
super().init()
blocks = []
blocks += [nn.Conv2d(1, dim_in, 3, 1, 1)]

    repeat_num = 4
    for _ in range(repeat_num):
        dim_out = min(dim_in*2, max_conv_dim)
        blocks += [ResBlk(dim_in, dim_out, downsample="half")]
        dim_in = dim_out

    blocks += [nn.LeakyReLU(0.2)]
    blocks += [nn.Conv2d(dim_out, dim_out, 5, 1, 0)]
    blocks += [nn.AdaptiveAvgPool2d(1)]
    blocks += [nn.LeakyReLU(0.2)]
    self.shared = nn.Sequential(*blocks)

    # self.unshared = nn.ModuleList()
    # for _ in range(num_domains):
    #     self.unshared += [nn.Linear(dim_out, style_dim)]
    self.unshared = nn.Linear(dim_out, style_dim)

def forward(self, x, y):
    h = self.shared(x)
    h = h.view(h.size(0), -1)
    # n speaers encoder
    # for layer in self.unshared:
    #     out += [layer(h)]
    # out = torch.stack(out, dim=1)  # (batch, num_domains, style_dim)
    # idx = torch.LongTensor(range(y.size(0))).to(y.device)
    # s = out[idx, y]  # (batch, style_dim)
    s = self.unshared(h)
    return s

`

@980202006
Copy link
Author

Is it possible to add wavelet transform to the model, such as referring to the design of swagan's generator

@yl4579
Copy link
Owner

yl4579 commented Nov 20, 2021

@980202006 It's definitely possible to add wavelet transform to the model and it could theoretically make a big difference because the high-frequency content is what makes speech clear even though the mel-spectrogram looks visually the same. However, I can't say exactly how much high-frequency content is there in mel-spectrogram because the resolution of mel specs is usually very low and what vocoders do is exactly uncover the lost high-frequency information. I think fine-tuning with hifi-gan probably would do the same thing, but you can definitely try and see if it helps.

@yl4579
Copy link
Owner

yl4579 commented Nov 20, 2021

Back to the style encoder problem, so I think you removed the shared linear layers (N of them where N is the number of speakers) and replaced it with a single linear projection for every speaker. I have tried this approach too, but it seems like the style encoder has a hard time encoding the speaker characteristics and usually returns a style vector that sounds like a combination of seen speakers during training instead. However, if you use simple gradient descent to find the style that can reconstruct unseen speakers, it is usually possible to find such a style and it preserves most of the characteristics during reconstruction, exactly like what you have presented here. In fact, the style encoder sometimes even fails to find a style that reconstructs the seen speakers in my case. My hypothesis is that the shared projections lack the power to separate different speakers while unshared projections force the models to learn more about the speaker characteristics.

One way to verify this is to train a linear projection for each speaker that reconstructs the given input by fixing both the self.shared part of the style encoder and the generator, and retrain everything from scratch by only training the style encoder with the original recipe (i.e. use the unshared linear projections). If my hypothesis is correct, the style encoder trained with one linear projection will be worse than the one trained with N linear projections in terms of encoding speaker characteristics, and we can proceed from there if it is correct.

@980202006
Copy link
Author

In my model, I regard the style encoder as a speaker information extraction model, that is, it extracts the high-dimensional representation of the speaker from the mel instead of fitting a specific speaker vector space. I prefer to use points instead. Non-spatial to represent a speaker, which may result in the loss of some information. Because, I found that the original style encoder has an average pooling operation, which is very similar to x-vector or d-vector.
The problem may be caused by your insufficient number of speakers. I used speech and singing data, with at least 70 speakers.
I will try the effect of a single linear layer.

@980202006
Copy link
Author

Thank you!

@ZhaoZeqing
Copy link

@yl4579 My own F0 model seems ok, like this:
image

but I didn't add noise for augmentation when training ASR and F0 model, is data augmentation necessary?

One more question, I want to train an any-to-one VC model, do I need to use Auto-Encoder instead of StarGAN?

@980202006
Copy link
Author

I found that for some speakers I haven't seen before, the voice change effect is OK, but the effect of other speakers is poor.
Is this because the vocal features of the person's speaker are not present in the dataset?
Is the formant ratio sufficient to uniquely identify a person's timbre, or is there any absolute representation of a person's timbre?

@980202006 A person's timbre is determined by a lot of things, it's not just the formant but also the energy and high-frequency harmonics (which by definition are the formats but we usually don't consider formants of higher orders). Note that AdaIN normalizes to those features of a person's voice, and the main reason it doesn't work very well for some speakers is the covariate shift (i.e., this speaker is too different from the speakers seen in the training set). I don't believe handcrafting any specific features helps here, most if not all deep learning problems can be solved by enlarging the model capacity and more data, unfortunately.

@yl4579 thanks. I also found that the audio recorded from the mobile phone h5 has poor sound-changing effect, similar to this example; on the contrary, the sound-changing effect of dry sound is ok. Is there any solution for mobile phone channel compensation or data enhancement?
This is a dry voice conversion example.
https://drive.google.com/drive/folders/1kcl8WH8r7MLP4iGrmNyHEe682XViR2_K?usp=sharing
This is the result of a mobile phone recording.
https://drive.google.com/drive/folders/115KJUzg7wvKHHZkJI2loBZJ90Fp4pV-L

@980202006
Copy link
Author

This is more likely to be a problem with your data or model, or a back-propagation problem caused by the torch statement. Since it cannot fit the data well, the model is constantly trying to increase or decrease the scale of the data.

@980202006
Copy link
Author

使用多个判别器是有效的,当模型收敛时,看不见的说话人上的音质更好,与目标说话人的相似度优于原始说话人。如果我用x-vector,模型可以捕捉到训练集中没有出现的声音特征,但是音质更差,只捕捉特征并不能很好的提升整体声音的相似度。如果您使用原始风格的编码器,许多看不见的声音特征会丢失。能给一些优化建议吗?另外想问下如何根据训练好的模型对unseen speake进行微调,尤其是discriminator没有保留unseen speake的索引。

嗨,你如何应用多个鉴别器?好像很复杂,因为和身份有关,

I am still trying to sort out the ideas here. The basic idea is to use multiple discriminators, each of which only discriminates a part of the speakers (random selection).

@yl4579
Copy link
Owner

yl4579 commented Apr 14, 2022

@980202006 The clean voice sounds very good, though fine-tuning the vocoder would improve the sound quality. You may want to use vocoders specifically designed for singing synthesis.

However, I cannot listen to the mobile phone recorded results. I don't have permission for that, can you share the folder please?

Although I can't listen to the samples, my guess is that voices recorded with mobile phones are worse in sound quality so the speakers' characteristics cannot be well captured by the model. You can either use data augmentation to corrupt the input to the style encoder for a more robust style representation or you can just do speech enhancement to make the sound quality better. This for example sounds exceptionally good: https://daps.cs.princeton.edu/projects/Su2021HiFi2/index.php

@980202006
Copy link
Author

980202006 commented Apr 15, 2022

@yl4579
Thank you, is there a better way to use data augmentation? I tried the common data augmentation: adding reverberation and noise, but no good results were achieved.
I modified the folder permissions and it should be viewable now.

@980202006
Copy link
Author

980202006 commented Apr 17, 2022

@yl4579 I'm missing something about the problems deep learning might have. Are there reviews that cover various issues, such as covariate shift?

@yl4579
Copy link
Owner

yl4579 commented Apr 18, 2022

@980202006 Did you add reverb to the input for the style encoder? How did you the data augmentation?
As for the deep learning problems, I believe this is less a problem for deep learning but more for machine learning. I'd suggest you take systematic machine learning classes that focus on the theories (instead of the practices).

@980202006
Copy link
Author

@yl4579 Yes, I added reverb to the input data of the style encoder. Thank you.

@Kristopher-Chen
Copy link

@yl4579 @980202006 I found the speech intelligibility gets worse compared to the sources, especially when I test Chinese in a model trained by English. How to relieve this phenomenon?

And @980202006 are you using a multi-language ASR for the multi-speaker training, as your datasets include Chinese, English, and singing?

@Kristopher-Chen
Copy link

@980202006 I listened to your demo, and I think they are pretty good, especially the speech intelligibility is very good. How do you manage that? Could you leave an e-mail for more discussions for details in Chinese demos?

@yl4579
Copy link
Owner

yl4579 commented Apr 24, 2022

@980202006 How does the result differ when your input to the style encoder is reverberated and not reverberated? Do they sound similar or quite different?

@yl4579
Copy link
Owner

yl4579 commented Apr 24, 2022

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

@Kristopher-Chen
Copy link

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

@yl4579 Recently, I trained a model with 100 speakers from VCTK. When evaluating, I met with some problems.

https://drive.google.com/drive/folders/1lraGNF3tGzExGnmhvXo3QDrc72uE23zg?usp=sharing

  1. Speech intelligibility degradation, as mentioned, even in seen speakers. You can find the examples from the link above.
    Also, I found the ASR loss decreased from about 0.3 to 0.1 when the speakers increased from 20 to 100.

  2. Reference for style encoder. When testing with different references with the same target speaker, the results vary significantly, and some may become unacceptable. Also, examples from the link above. Should I use the average of more sentences, otherwise how to choose a proper reference?

@980202006
Copy link
Author

@980202006 How does the result differ when your input to the style encoder is reverberated and not reverberated? Do they sound similar or quite different?

If the reverberation data (style encoder) is added during training, it will alleviate the problem during unseen inference, but it will not completely solve the problem.
If the reverberation is more obvious, training on the model without adding reverberation data will result in poor results, and each word is blurred.
I can't find the old model, I'm training a new model, I expect to give an example in early May.
Thank you!

@980202006
Copy link
Author

980202006 commented Apr 25, 2022

@Kristopher-Chen
fujindemi@gmail.com

Maybe you want to see if your speaker classification discriminator has collapsed. Training this discriminator must be careful. I guess it is better to not let its loss drop towards 0, but to give a balanced value.

@MMMMichaelzhang
Copy link

@Kristopher-Chen the original model was not proposed to tackle cross-lingual voice conversion, so you may need to train an ASR model that works for both English and Chinese (e.g., using IPAs) and train a model with both English and Chinese data. The ASR training code will be made available soon, at the latest in late May.

How is the asr training code progressing now?I am very looking forward to it。

@yl4579
Copy link
Owner

yl4579 commented Jun 8, 2022

@MMMMichaelzhang It is available here: https://github.com/yl4579/AuxiliaryASR

@Charlottecuc
Copy link

@MMMMichaelzhang It is available here: https://github.com/yl4579/AuxiliaryASR

Hi. How is the JDC code progressing? Thank you very much~

@yl4579
Copy link
Owner

yl4579 commented Jun 14, 2022

@Charlottecuc I'm still working on it, I'll create another repo probably by this week.

@yl4579
Copy link
Owner

yl4579 commented Jun 15, 2022

@Charlottecuc The training code for F0 model is available now: https://github.com/yl4579/PitchExtractor

@CrackerHax
Copy link

CrackerHax commented Aug 17, 2022

I trained on a set from an audio book, over 4000 samples from a single voice and training 110 epochs. When I generate it sounds like the source audio, not the trained voice OR the style. Any idea what the problem could be? Do I just need a lot more training or what?
image

@yl4579
Copy link
Owner

yl4579 commented Aug 22, 2022

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment)

@CrackerHax
Copy link

CrackerHax commented Aug 24, 2022

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment)

I trained again with fp16=false and still got nan (at the same epoch as fp16=true). The only change I made in config file was it's a single voice (num_domains: 1)
Dataset is about 4000 samples at 24000 and I was training from scratch (no transfer)

@CrackerHax
Copy link

@CrackerHax Your loss becomes nan, so the model is broken. This is likely caused by bad normalization because some value exceeds 65535 (float 16 maximum number). See #6 (comment)

I did some transfer learning with 20 voices on the default model and it worked fine.

@Liujingxiu23
Copy link

Liujingxiu23 commented Sep 15, 2022

@980202006 @yl4579
Your discussion is very enlightening, however as a beginner, I really can't fully understand all your discussion.
my subject is cross-domain singing voice conversion for only four speakers, spkeaer 1 and 2 are singers with only songs dataset, and speaker 3 and 4 are speakers with only speech dataset. what I want to do is only 1/2 --> 3/4, to let speakers 3 and 4 to have song data. all speakers are chinese speakers.
what should be to improve the result?
1.can I remove style-encoder and map_encoder and just use one-hot-speaker-embedding? Will it help?
2.should I remove loss_f0_sty.
3.what the current ASR model and F0 model preformed on song datas? Is it necessary to retrain these two models?
Do you have any other suggestions?
Thank you again.

@MMMMichaelzhang
Copy link

I set num_domain=1 and I meet the same problem,have you sovled it?@CrackerHax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion New research topic
Projects
None yet
Development

No branches or pull requests

10 participants