VAE-Tacotron-2/1 以及 VQ-VAE的原理探讨与实现.

An implementation of VAE Tacotron speech synthesis in TensorFlow. (https://arxiv.org/abs/1812.04342)

1.https://github.com/yanggeng1995/vae_tacotron.

2.requirement.txt都满足.

3.

Blizzard2013 368K training results.Will vae-tacotron2 get better results?、


 
 
  1. I trained the model 368k with Blizzard2013 and the here is the result (Parallel transfer)
  2. https: //drive.google.com/drive/folders/12dBWg883S1VXQ0jEzJ7Lcz1bxI3lI_2t
  3. You can hear that 118 119 is well but 120 has less prosody.
  4. These audios have metal tones because of the Grrifin-LIM.I will use vocoder such as wavenet or wavernn to improve audio quality.
  5. I think training less than 200k times it cant train prosody. The model 's mission within 200k times is training a good alignment .More than 200K it will begin to train prosody according to the reference.
  6. 368k isn‘t enough to train a good prosody model. I think the number of training needs to be greater. I will keep training.
  7. By the way if i change the model into tacotron2.Will it produce better results or simliar results like tacotron1 ?Does anyone trained model as good as the paper?

4. 反正效果不好, 最好的就是3中的.  如果现成的VAE-Tacotron-2/1不好找, 其实找一些VAE实现比较好的也行, 预训练好迁移过来. 现在GST已经很成熟了, https://github.com/syang1993/gst-tacotron, Kyubyong/expressive_tacotron考虑GST之后再vae, 把两步骤分开, 不急于sub condition信息, 或者用我设计的赤鞘巨人的结构.

突然想到用prosody那部分, 不是重点考虑的对象! 不需要非得在那个地方用VAE! 甚至不需要包含这个prosody结构. 需要思考的是如何让speaker id在不同的音素之间迁移. 目前直观的时反loss, 之后加一些相关性的. 需要请教别人. 同时思考google论文中对prosody结构存在的解释, 有好, 没有也可以, 跟voice clone没关, 而我的目的也不是voice clone, 而是code-switching.

Kyubyong/vq-vae不懂.

5. 有处理BC数据的代码.

VAE Tacotron-2 (https://github.com/rishikksh20/vae_tacotron2)

Tensorflow Implementation of Learning latent representations for style control and transfer in end-to-end speech synthesis

1. In my testing, I havn't get good results so far on style transfer side. 

2. author of the paper used 105 hrs of Blizzard Challenge 2013 dataset

 

开始写代码: 先直接用yanggeng1995的, 然后去看VAE的经典版本, 然后迁移过来, 最好能够预训练, 然后再和syang1993结合.

以及严格按照论文中的结构: For each encoder, a mel spectrogram is first passed through two convolutional layers, which contains 512 filters with shape 3 × 1. The output of these convolutional layers is then fed to a stack of two bidirectional LSTM layers with 256 cells at each direction. A mean pooling layer is used to summarize the LSTM outputs across time, followed by a linear projection layer to predict the posterior mean and log variance.

 

口音, 韵律, 都有类似的, speaker 单指voice的话 好好定义voice clone中的voice  粤语和英语很像   所以可以  普通话=>粤语=>英语

"读万卷书 行万里路 阅人无数 高人指路"

 

金庸大师小说用余秋雨大师风格重新写一遍(add), 再读一遍,就很好.

有时间去复习一下本科时学的模式识别, 去学/补一下信号处理.

不用诗集, 歌词就行.

 

非常重要的!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

 

用分开的高手!!!!!!!!!!!!!!!!!!!!!!!!

 

VAE GAN 和  另外一个

google非常习惯并且喜欢在decoder之前concat

 

[8] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical gen-erative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.

 

 

但是如何防止过度有信息? 结合 ad loss!

Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder 另外一篇几乎一样的论文.

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值