复现END-TO-END CODE-SWITCHED TTS WITH MIX OF MONOLINGUAL RECORDINGS论文, 理解以及代码, 以及实验结果.

Show us the samples please? By the way, you had better change the mel loss function into MAE and watch the alignment again.

These plots show that BahdanauMonotonic Attention is better.

What are the advantages of Location Sensitive Attention?

Maybe it is better to let the network learn without any monotonic pressure. However https://arxiv.org/abs/1803.09047 claims to use GMM on Tacotron and obtain better results, especially for longer sequences.

 

do you have a change related to guided attention?

I am thinking use phone duration information to generate the guided attention for training; 对, 只提供"参考价值", 不用完全相信. 设计网络.

 

can you provide the code for the GMM attention? I cannot find a working version that gives good alignments anywhere.

I don't have it either anymore. I totally ditched it. You can pick that out from "voice loop" repo.

FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS

https://github.com/geneing/WaveRNN-Pytorch   Fast WaveRNN

https://github.com/mozilla/TTS/blob/master/notebooks/Benchmark.ipynb

“On-line and linear-time attention by enforcing monotonic align-ments,  

机器学习中,是否有给注意力机制加先验的工作或者特殊的初始化方法?

如题,有些问题里注意力有比较明显的规律,例如机器翻译中有些语言对的语序基本一致,这时候能否给注意力读写头加入适当的先验,让网络快速收敛?

自问自答一下,因为今天突然看到一篇文章,已被 ICML 2017 接受:

Online and Linear-Time Attention by Enforcing Monotonic Alignments

去搜索这个的名字, 可能能找到对应的结构. (1)

大意是用抛硬币的方法决定要不要继续往后走,每次只选一个 encoder 的状态来做 context,从而实现 attention 从前往后只走一遍

先近似使用:

注意力有content-based和location-based两种,我觉得location-based很像你说的先验。

参考:http://papers.nips.cc/paper/58

开始写代码: LDE

determined by the language boundary information in the CS text. 

performing discriminative code lookup 对于 speaker id来说, 先近似实现, 是不是有可以差异化初始化或者查询的方法?

 

This design enables the gen-erated speech in a single speaker’s voice. The language embedding and discriminative embedding are jointly learned with the model by back-propagation.  这个也是一个切入点.

 

The discriminative embedding is obtained by performing discriminative code lookup, and is concate-nated with previous time-step decoder output and context informa-tion before being sent to decoder RNN.  这一点原版论文和大家理解的是不一样的, 这一版代码跑的是原版的Tacotron-2, 而不是微软理解的Tacotron-2.

 

https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn  论文写的不清楚, 按照自己理解的拼接进去, 其实init的时候也有错误. (2)

np.zeros()  和 list 的区别, 一直报错.  return array(a, dtype, copy=False, order=order) ValueError: setting an array element with a sequence.

 

 

 

 

 

 

感觉少了一步decoder!!!!!!!!!!!先不改, 等效果, 然后再改.!!!!!!!!!!!!!! 看不懂, 看不懂, 可能没有错吧. (3)

在文件Architecture_wrappers.py中

 

https://github.com/begeekmyfriend?tab=repositories  探究人家的东西.

https://github.com/fatchord?tab=repositories 还有他的.

https://github.com/r9y9/gantts  VAE另外的一条路.

 

 

Tacotron: Advanced attention module (e.g. Monotonic attention) #13

https://github.com/mozilla/TTS/issues/13

https://github.com/mozilla/TTS

https://github.com/mozilla/TTS/blob/master/notebooks/Benchmark.ipynb

 

 

Guided Attention Loss #346

https://github.com/Rayhane-mamah/Tacotron-2/issues/346

 

 

 

 

http://itjcc.com/1172/html  破解ultraledit 26. 等有工资了一定补上去.

https://blog.csdn.net/xiliuhu/article/details/5757305   ultral edit多窗口实现.

统计attention不加限制情况下的单调和不单调情况, 然后再加单调的要求, 这是其实是两条路, 都能解释, 同时在用si-单调, 指导它, 不改变他.

 

制作基于LJSpeech1.1和标贝的训练数据集 和 脚本

1. grapheme

 

 

Whether to rescale audio prior to preprocessing   参数弄不明白.

rescale = False, #Whether to rescale audio prior to preprocessing

 

#M-AILABS (and other datasets) trim params
    trim_fft_size = 512,
    trim_hop_size = 128,
    trim_top_db = 60,

也不明白.

sox的使用办法:

https://blog.csdn.net/centnetHY/article/details/88571352

Batch_Size=32 => 16, 因为内存不足.

 watch -n 10 nvidia-smi

至于SPE, 代码很好写:

就剩下整理数据, 实验结果了, 放出来一个网页demo.

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值