Tactron 2学习笔记3 decoder篇

最新推荐文章于 2023-01-20 13:51:44 发布

取什么名才不重复呢

最新推荐文章于 2023-01-20 13:51:44 发布

阅读量655

点赞数

文章标签： tensorflow 深度学习

本文链接：https://blog.csdn.net/weixin_42256251/article/details/105371156

版权

在这里插入图片描述
decoder由以下几个部分组成，包括prenet，attention，decoder_lstm，frame_projection，stop_projection等。

首先，我们来看看prenet，从代码中类Prenet中可以看出，prenet的主体是two fully connected layers，每一层接一个dropout。

接着看看attention部分，称之为LocationSensitiveAttention，从代码中可以看出该attention方法继承的是 tensorflow.contrib.seq2seq.python.ops.attention_wrapper.BahdanauAttention。

该模块里面是普通的 MultiRNNCell ，没什么特别的。

该模块的输出是mel谱，其内部也是fully connected layer。

该模块的作用是预测是否结束，是个简单的二分类问题，其内部也是fully connected layer。

最后在tacotron 2中定义了类TacotronDecoderCell，将上述几个模块封装在一起。下面一起看看该类内部是怎么运行的，先看看该类的注释原文：
Decoder Step i:

Prenet to compress last output information
Concat compressed inputs with previous context vector (input feeding) *
Decoder RNN (actual decoding) to predict current state s_{i} *
Compute new context vector c_{i} based on s_{i} and a cumulative sum of previous alignments *
Predict new output y_{i} using s_{i} and c_{i} (concatenated)
Predict <stop_token> output ys_{i} using s_{i} and c_{i} (concatenated)
概括来讲，Prenet的输出与上一次解码输出计算而得的上下文向量做拼接，然后整个送入RNN解码器中，RNN解码器的输出用来计算新的上下文向量，最后新计算出来的上下文向量与解码器输出做拼接，送入projection layer预测输出。
在网络的最后，接了一个postnet来预测残差值，postnet的输入是上面的输出，将残差再输入frame_projection网络。最终，这里的输出加上上面的输出，就是网络的最后输出了。需要注意的是，由于前面decoder输出的mel谱或者线性谱，后面还需要接一个声码器，比如wavenet。

关注