Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditional Layer Normalization #2

Open
Liujingxiu23 opened this issue Mar 17, 2021 · 4 comments
Open

Conditional Layer Normalization #2

Liujingxiu23 opened this issue Mar 17, 2021 · 4 comments

Comments

@Liujingxiu23
Copy link

Hi, I followed your work for several months and really pleasantly surprised at your speed of tracking the new algorithm.
For the Adaspeech, have your verify that the two acoustic encoder really help the training of custom speakers? How it is compared to speaker-embedding generated by speaker-encoder using in speaker verification task?
And for the "Conditional Layer Normalization", you have not implement it ,right? Is the following reference suitable if I realize it myself? Or Can you give amy suggest to do this?
https://github.com/exe1023/CBLN/blob/e395edc2d6d952497b411f81eae4aafb96749bc2/model/cbn.py
https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py

@hoyden
Copy link

hoyden commented Mar 17, 2021

In my opinion, utterance level encoder is alternative to an extern speaker encoder model. So if you could use an extern speaker encoder model to extract speaker embedding maybe better.

@rishikksh20
Copy link
Owner

@Liujingxiu23 https://github.com/CyberZHG/torch-layer-normalization/blob/master/torch_layer_normalization/layer_normalization.py this works good. Yes speaker embedding generated by speaker encoder using in speaker verification works.

@Liujingxiu23
Copy link
Author

@rishikksh20 Thank you for your reply. I am trying this and other similar methods to relize personalized-tts that use mobile phone to record audios of users. But the results are not very good, shack and unstabitily are the main problems of synthesized wavs. I am wondering if it is the problem of vocoder, I could not find a universal vocoder using deep learning method.

@MMingabc
Copy link

MMingabc commented Mar 3, 2022

My experiments showd that in a multi-speaker senario the phoneme level mel encoder encodes too much infomation. As a consequence if the phoneme level predictor is not capable enough the performance drops a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants