Attention的原理和实现

最新推荐文章于 2024-08-15 14:15:00 发布

Vinsmoke -Hou

最新推荐文章于 2024-08-15 14:15:00 发布

阅读量1.2k

点赞数

分类专栏：深度学习 pytorch 文章标签：自然语言处理深度学习神经网络

本文链接：https://blog.csdn.net/one_super_dreamer/article/details/120033649

版权

深度学习同时被 2 个专栏收录

18 篇文章 3 订阅

订阅专栏

pytorch

7 篇文章 0 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

目标
一、Attention的介绍
二、Attention的实现机制

目标

1. 知道Attention的作用
2. 知道Attention的实现机制
3. 能够实现Attention代码的编写

提示：以下是本篇文章正文内容，下面案例可供参考

一、Attention的介绍

在普通的RNN结构中，Encoder需要把一个句子转化为一个向量，然后在Decoder中使用，这就要求Encoder把源句子中所有的信息都包含进去，但是当句子长度过长的时候，这个要求就很难达到，或者说会产生瓶颈（比如：输入一篇文章等长内容），当然我们可以使用更深的RNN和大多的单元来解决这个问题，但是这样的代价也很大，那么有没有什么方法能够优化现有的RNN结构？
Attention翻译成中文叫做注意力，把这种模型称为Attention based model。就像我们自己看到一幅画，我们能够很快的说出画的主要内容，而忽略画中的背景，因为我们注意的往往是其中的主要内容。
通过这种方式，在我们的RNN中，我们有通过LSTM或者GRU得到的所有信息，那么这些信息中只关注重点，而不需要在Decoder的每个time step使用全部的encoder的信息，这样就可以解决第一段所说的问题了。

二、Attention的实现机制

假设我们现在有一个文本翻译的需求，即“机器学习”翻译成matchine learning。那么这个过程通过前面所学习的Seq2Seq就可以实现。
在这里插入图片描述
上图左边是Encoder，能够得到hidden_state在右边使用
Decoder中蓝色方框中的内容，是为了提高模型的训练速度而使用teacher forcing手段，否则的话会把前一次的输出作为下一次的输入（但是在Attention模型中不再是这样了），那么整个过程中如果使用Attention应该怎么做呐？
在之前我们把encoder的最后一个输出，作为decoder的初始隐藏状态，现在不用再这样做了。

2.1 Attention的实现过程

在这里插入图片描述

上述参考：http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

2.2 不同Attention的介绍

在上述过程中。使用decoder的状态和encoder的状态计算后的结果作为权重，乘上encoder每个时间步的输出，这需要我们去训练一个合适的match函数，得到的结果就能够在不同的时间步上使用不同的encoder的相关信息，从而达到只关注某一个局部的效果，也就是注意力的效果。

2.2.1 Soft-Attention 和Hard-Attention

在这里插入图片描述
soft-attention：encoder中每一个输出都会计算一个概率
hard-attention：encoder中只寻找一个计算概率

2.2.3 Global-Attention 和 Local Attention

在这里插入图片描述
Global Attention：使用全部的encoder的输出来计算Attention的权重
local Attention：使用部分encoder的输出计算权重

2.2.4 Bahdanau Attention和Luong Attention

在这里插入图片描述
Bahdanau Attention的计算过程：

前一次的隐藏状态和encoder的output进行match计算得到attention weight
attention weight 和encoder output计算得到context vector
context vector作为当前时间步的输入
当前时间步的输入还会有前一次的隐藏状态
得到当前时间步的输出和当前时间步的隐藏状态

Luong Attention的计算过程：
GRU计算得到的decoder的hidden state
hidden state和encoder的output进行match计算得到a_t（attention weight）
attention weight 和encoder的output计算得到context vector
context vector 和GRU的当前时间步的output合并计算得到最终的输出
两者的区别：
encoder上：
bahdanau attention双向GRU进行编码，正向和反向的output进行concat之后的结果作为encoder的结果。
luong attention 使用单向多层的GRU，把最后一层的输出作为encoder的输出。
decoder上：
bahdanau Attention使用之前的hidden state来和encoder的output计算，得到Attention weight 和context vector，作为GRU的输入。
luong Attention：当前时间步的输出和encoder的output计算得到attention weight，在和encoder out计算得到context vector，和decoder的output进行concat作为输出。
在计算方式上：

$h_t$ 是当前decoder hidden state， $h_s$ 是所有的encoder的hidden state（encoder output）
最终两个attention 的结果区别并不太大，所以以后我们可以考虑使用Luong attention完成代码。

2.Attention的代码实现

完成代码之前，我们需要确定我们的思路，通过attention的代码，需要实现计算的是attention weight。通过前面的学习，我们知道attention_weight = f(hidden,encoder_outputs)，主要就是实现Luong attention的三种操作
在这里插入图片描述

class attention(nn.Module):
    def __init__(self,method = "general"):
        super(attention,self).__init__()
        assert method in ["dot","general","concat"],"method error"
        self.method = method
        if method == "general":
            self.wa = nn.Linear(config.chatbot_encoder_hidden_size,config.chatbot_decoder_hidden_size,bias=False)
        if method == "concat":
            self.wa = nn.Linear(config.chatbot_encoder_hidden_size+config.chatbot_decoder_hidden_size,config.chatbot_decoder_hidden_size)
            self.va = nn.Linear(config.chatbot_decoder_hidden_size,1)
    def forward(self, hidden_state,encoder_outputs):
        """
        :param hidden_state:[num_layers,batch_size,decoder_hidden_size]
        :param encoder_outputs:[batch_size,seq_len,hidden_size]
        :return:
        """
        if self.method == "dot":
            hidden_state = hidden_state[-1,:,:].permute(1,2,0)  # [batch_size,hidden_state,1]
            attention_weight = encoder_outputs.bmm(hidden_state).squeeze(-1)# [batch_size,seq_len]
            attention_weight = F.softmax(attention_weight)      # [batch_size,seq_len]
        elif self.method == "general":
            encoder_outputs = self.wa(encoder_outputs)  # [batch_size,seq_len,decoder_hidden_size]
            hidden_state = hidden_state[-1, :, :].permute(1, 2, 0)  # [batch_size,hidden_state,1]
            attention_weight = encoder_outputs.bmm(hidden_state)
            attention_weight = F.softmax(attention_weight)      # [batch_size,seq_len]
        elif self.method == "concat":
            hidden_state = hidden_state[-1, :, :].squeeze(-1)   # [batch_size,decoder_hidden_state]
            hidden_state = hidden_state.repeat(1,encoder_outputs.size(1),1) # [batch_size,seq_len,decoder_hidden_state]
            concated = torch.cat([hidden_state,encoder_outputs],dim=-1)    # [batch_size,seq_len,decoder_hidden_state+encoder_hidden_state]
            ## 由于Linear只能进行二维的计算，所以接下来需要先把三维的转变成二维的矩阵，使用view方法
            batch_size = encoder_outputs.size(0)
            seq_len = encoder_outputs.size(1)
            temp_concated = concated.view(batch_size*seq_len,-1)
            attention_weight = self.va(F.tanh(self.wa(temp_concated))).squeeze(-1)  # [batch_size*seq_len]
            attention_weight = F.softmax(attention_weight.view(batch_size,seq_len))  # [batch_size,seq_len]

        return attention_weight

在decoder的forward_step方法中使用：

    def forward_step(self,decoder_input,decoder_hidden,encoder_outputs):
        """
        计算每个时间步上的结果
        :param decoder_input:[batch_size,1]
        :param decoder_hidden:[1,batch_size,hidden_size]
        :return:
        """
        decoder_input_embeded = self.embedding(decoder_input)
        # out:[batch_size,1,hidden_size]
        # decoder_hidden:[1,batch_size,hidden_size]
        out,decoder_hidden = self.gru(decoder_input_embeded,decoder_hidden)

        #############添加attention##############
        # attention的输出为[batch_size,seq_len]，要和encoder_outputs[batch_size,seq_len,encoder_hidden_size]做乘法，需要对attention维度进行变换
        attention_weight = self.attn(decoder_hidden,encoder_outputs).unsqueeze(1)    # [batch_size,1,seq_len]
        context_vector = attention_weight.bmm(encoder_outputs)          # [batch_size,1,encoder_hidden_size]
        concated = torch.cat([out,context_vector],dim=-1).squeeze(1)    # [batch_size,1,decoder_hidden_size+encoder_hidden_size]
        out = self.wc(concated)                # [batch_size,hidden_size]
        #############attention结束##############
        # out = out.squeeze(1) # [batch_size,hidden_size]
        output = F.log_softmax(self.fc(out),dim=-1) # [batch_size,vocab_size]

        return output,decoder_hidden