Attention is all your need

最新推荐文章于 2022-07-02 11:11:40 发布

DeepWWJ

最新推荐文章于 2022-07-02 11:11:40 发布

阅读量264

点赞数

分类专栏： attention

本文链接：https://blog.csdn.net/qq_21157073/article/details/97779598

版权

attention 专栏收录该内容

3 篇文章 1 订阅

订阅专栏

在这里插入图片描述
网络包括“encoder”和“deccoder”两部分:

“encoder”接收来自“input_”的输出，对其进行embedding映射并加入位置编码“positional_encoding”，然后经过6个“num_blocks”操作，其中每个“num_blocks”包括两个层：“positionwise_feedforward”和“multihead_attention”。经过6次“num_blocks”运算后得到一个向量encode_output。也就是上图左面一部分。
“decoder”接收encode_output和targets信息（targets为inputs中的数据后移一位），首先将targets信息映射并且加入“positional_encoding”位置信息。然后和“encoder”一样，经过6个“num_block”进行运算。“decoder”中的“num_block”由3部分组成：“self_attention”，“vanilla_attention”，“positionwise_feedforward”。“positionwise_feedforward”依旧是一个神经网络，“self_attention”和“vanilla_attention”和“encoder”中的“multihead_attention”实现大同小异。“self_attention”接收是targets的映射数据，使用自注意力机制进行运算，然后将结果与encoder_output一同输入到“vanilla_attention”中使用targets与encoder编码输出的相似性得到最后解码结果decode_output。也就是图片的右半部份。此外在“decoder”的“self_attention”不仅需要进行query mask和key mask还要屏蔽掉未来的信息。
例如：
网络输入为 “input_” ，是对一个batch的句子进行单词的ID映射后的整型二维数据，假定shape是（10，100），表示一个batch是10句话，每句话长100个字（超出截取，不足补齐）。
Encoder

然后Tensor进入“encoder”中进行编码，“encoder”如下所示：

“embedding_lookup”通过使用（32000，512）的矩阵将输入数据（10，100）映射为（10，100，512），也就是将每个词的ID映射为一个长度
为512的向量。
因为网络结构中没有使用CNN以及RNN来提取区域或者时许信息，对于数据的输入需要使用“positional_encoding”添加位置序列编码。
最后将得到的向量输入droup_out层。droup_out层的输出Tensor依旧是（10，100，512）。
随后Tensor进行6次相同的操作，即图片中的“num_blocks_(0，1，2， 3，4，5)”，每一个num_blocks又包括两层结构，一个是多头注意力“multihead_attention”和前向反馈层“positionwise_feedforward”。其中“positionwise_feedforward”就是简单的神经网络，“multihead_attention”是多头注意力。
最后经过6次“num_block”编码后输出一个（10，100，512）的向量。

Decoder
decoder层接收三个输入，词向量矩阵，targets的编码，还有来自encoder的输出。

inputs向后偏移一位作为targets，和encoder层一样，通过"embedding_lookup"层将targets映射为（10，100，512）的Tensor。
然后进行“positional_encoder”加入序列信息，在通过droup_out层映射。
最后和encoder的输出一同输入到"num_block"中。

decoder和encoder一样也是有6个"num_block"，但是与encoder不同的是，encoder中的num_block有两层“multihead_attention”和“positionwise_feedforward”，但是decoder中有三层：“self_attention”，“vanilla_attention”，“positionwise_feedforward”。其中“positionwise_feedforward”也是一个前馈神经网络。“self-attention”用来计算targets中的注意力信息，“vanilla_attention”用来计算“self-attention”结果与encoder输出之间的注意力，最后输入“positionwise_feedforward”中运算。

 # Masked self-attention (Note that causality is True at this time)
                    dec = multihead_attention(queries=dec,
                                              keys=dec,
                                              values=dec,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=True,
                                              scope="self_attention")

在“self-attention”中，会使用causality抑制未来的输入。causality参数告知我们是否屏蔽未来序列的信息（解码器self attention的时候不能看到自己之后的那些信息），这里即causality为True时的屏蔽操作。可以参考：https://blog.csdn.net/mijiaoxiaosan/article/details/74909076 来look look。
最后decoder的结果是一个（10，100，512）的向量。经tf.enisum操作后得到一个（10，100，32000）的向量。最后与映射后的labels（10，100，32000）输入到“softmax_cross_entropy_with_logits”计算损失，开始训练。

DeepWWJ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Attention is all your need

网络输入为 “input_” ，是对一个batch的句子进行单词的ID映射后的整型二维数据，shape是（10，100），表示一个batch是10句话，每句话长100个字（超出截取，不足补齐）。Encoder然后Tensor进入“encoder”中进行编码，“encoder”如下所示：“embedding_lookup”通过使用（32000，512）的矩阵将输入数据（10，100）映射为...
复制链接

扫一扫

专栏目录