Attention系列一之seq2seq传统Attention小结

最新推荐文章于 2024-07-03 16:50:12 发布

Kingslayer_

最新推荐文章于 2024-07-03 16:50:12 发布

阅读量1.4k

点赞数 5

分类专栏：自然语言处理文章标签： nlp 自然语言处理深度学习 Attention

本文链接：https://blog.csdn.net/qq_33278884/article/details/89313428

版权

正如标题所言，本文总结了一下传统的Attention，以及介绍了在seq2seq模型中使用attention方法的不同方式。

摘要

首先seq2seq分为encoder和decoder两个模块，encoder和decoder可以使用LSTM、GRU等RNN结构，这也是之前transformer没出来之前常用的经典方法。(主要选取了tensorflow官方教程和pytorch教程的例子作对比来详细介绍一下。)
也可以在encoder和decoder使用CNN来替代RNN，主要使用了gated linear units（GRU）结构，之后详细介绍，参考论文 Convolutional Sequence to Sequence Learning
当然还有最近很火的transformer，它摒弃了传统的RNN、CNN等结构，使用了self-attention和Multi-Head Attention结构，下一篇文章详细介绍。Attention is all your need

声明: 有些地方理解的不够深刻或许写的不太对，希望大家都带有自己的理解去看文章，最后有哪里写的不对的欢迎大家批评指正。

The Seq2Seq Mode (不加attention)

在介绍attention之前，先看一下不加attention的网络结构,可以看但每次decoder的不同位置都是使用encoder最后一层的隐状态输出C。所以不能更好的体现对于当前decoder的单词不同的encoder单词的影响程度。

使用Attention机制是为了解决几个问题，第一个就是当处理的句子文本过长时，模型性能下降，因为毕竟RNN信息传递过程中会有损失，对于句子起初位置的字或词经过RNN后传递到的信息较少，第二个就是attention对句子中每个字或词赋予不同的权重，可以使模型更好的关注句子中每个词带来的信息重要性。实验证明在长文本中使用Attention机制，效果会明显优于不适用attention的效果。

最后来放一下pytorch的实现代码：

encoder

decoder

可以看到只用到了encoder的最后一个unit的hidden.

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        embedded = self.embedding(input).view(1, 1, -1)
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output = self.embedding(input).view(1, 1, -1)
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)