机器学习周报第八周_def forward(self, x): u = x.mean(-1, keepdim=true)-CSDN博客

本文链接：https://blog.csdn.net/2301_78609379/article/details/132003631

摘要

传统RNN具有反馈连接，可以将当前时间步的隐藏状态传递到下一个时间步，以捕捉序列数据中的时间依赖关系。LSTM是一种改进的循环神经网络，引入了输入门、遗忘门和输出门的机制。输入门决定了当前时间步的输入信息中哪些部分应该被记忆单元所记录，遗忘门控制着上一个时间步的记忆是否被保留，输出门控制着当前时间步的隐藏状态中哪些记忆会被激活并输出。通过这些门的控制，LSTM可以选择性地将重要的信息保留在记忆单元中，并有效地捕捉长期的依赖关系，从而提高了对复杂序列数据的建模能力。

Abstract

Conventional RNNs have feedback connections that pass the hidden state of the current time step to the next time step to capture temporal dependencies in sequential data.LSTM is an improved recurrent neural network that introduces the mechanisms of input gates, forgetting gates, and output gates. Input gates determine which parts of the input information of the current time step should be recorded by the memory units, forgetting gates control whether the memories of the previous time step are retained, and output gates control which memories in the hidden state of the current time step will be activated and output. By controlling these gates, the LSTM can selectively retain important information in the memory cells and effectively capture long-term dependencies, thus improving the ability to model complex sequential data.

Translated with www.DeepL.com/Translator (free version)

一、循环神经网络(RNN)

1. 引例

如下图所示，在一个slot filling案例中，可能会出现单词相同但是该单词在不同的情况下属于不同的slot的情况。所以将语句输入后，模型不仅要通过单词本身来判断该单词所属的slot类型，还得结合语句的上下文来进行判断。循环神经网络(RNN)就可以实现。
在这里插入图片描述

2. 循环神经网络

RNN在自然语言处理、语音识别、时间序列预测等领域有着广泛的应用。RNN比前馈神经网络多了一个使用内存存储上一层输出的机制，允许之前的信息在网络内部进行传递。
如下图所示，a1是隐藏层的输出，RNN将其储存在内存之中。该内存储存的信息也将在下一次计算中作为输入的一部分。
在这里插入图片描述
以slot filling的案例为例，将单词arrive作为输入，输入到模型中后，将隐藏层的输出a1存储到内存中，输出y1表示单词arrive属于某个slot的几率。再将Taipei作为输入，在隐藏层中，同时考虑上一次隐藏层输出a1和输入Taipei，从而得到输出y2，y2表示单词Taipei属于某个slot的几率。接下来重复上述过程。
在这里插入图片描述
通过RNN的方法，就可以解决引例中出现的问题。如下图所示两个输入x2的值都为Taipei，但是在第一个语句中，输入x1为单词leave，隐藏层输出a1同样影响x2的输出值，而第二个语句中，x1的输入为arrive，所以模型能通过x1的值判断出第一个语句中，Taipei属于出发地这个slot，而第二个语句中Taipei属于目的地这个slot。
在这里插入图片描述

3. 两种变形

RNN有多种变形。下面来介绍Elman Network和Jordan Network。以上slot filling案例中使用的RNN架构是Elman Network。Jordan Network并不是将隐藏层的输出存储在内存中，而是将模型的输出存储在内存中。Jordan Network一般比Elman Network拥有更好的性能，因为将输出存在内存中，输出有固定的目标值。故性能更好。
在这里插入图片描述

二、长短期记忆网络(LSTM)

LSTM是"Long Short-Term Memory"的缩写，它是一种循环神经网络（RNN）的变形。在LSTM中有四个输入，分别是输入，输入门（input gate）、遗忘门（forget gate）和输出门（output gate）；以及一个输出。它可以解决传统RNN中的梯度消失和梯度爆炸问题，以及对长序列数据建模时的记忆能力不足的问题。

输入门控制输入信息是否进入LSTM的记忆单元。它包含一个Sigmoid激活函数，用于生成在0到1之间的输出值。当输入门输出接近1时，输入信息允许被输入；而当输出接近0时，输入信息无法输入。
遗忘门控制记忆单元中的信息是否被保留。遗忘门也包含一个Sigmoid激活函数，输出值在0到1之间。当遗忘门输出接近1时，上一次存在记忆单元中的信息保留；而当输出接近0时，上一次存在记忆单元中的信息被遗忘。
输出门控制哪些信息会被输出。输出门同样包含一个Sigmoid激活函数，输出值在0到1之间。输出门的输出决定了记忆单元是否会被激活并输出。

在这里插入图片描述

如下图所示，当x2=0时，input gate的值默认是负值，为关闭状态；forget gate的值默认为正值，为开启状态。当x3=0时，output gate的值默认为负值，为关闭状态。将第一个向量输入模型中，输入的值为3，input gate为开启状态，将3输入到记忆单元中，此时forget gate也为开启状态，将3与之前forget gate中的数据相加。但是output gate却是关闭状态，所以无法输出，输出的y值为0。其他的向量依次按照上述过程分析即可。
在这里插入图片描述
在和普通的神经网络模型相比，LSTM可以想象成把神经网络模型的神经元替换成LSTM中的记忆单元即可。他们的区别时普通的神经网络只需要一个输入，然后乘上相应的权重进行计算。但是LSTM需要四个输入。
在这里插入图片描述

三、自注意力机制代码

设超参数num_attention_heads为自注意力机制的头数。

self.num_attention_heads = num_attention_heads
self.attention_head_size = int(hidden_size / num_attention_heads)
self.all_head_size = hidden_size

定义 $W^{q}$ 、 $W^{k}$ 和 $W^{v}$ 三个矩阵。

self.query = nn.Linear(input_size, self.all_head_size)
self.key = nn.Linear(input_size, self.all_head_size)
self.value = nn.Linear(input_size, self.all_head_size)

将输入特征乘以三个矩阵 $W^{q}$ 、 $W^{k}$ 和 $W^{v}$

mixed_query_layer = self.query(input_tensor)
mixed_key_layer = self.key(input_tensor)
mixed_value_layer = self.value(input_tensor)

切分为num_attention_heads个头，并变换维度。

def transpose_for_scores(self, x):
   new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
   x = x.view(*new_x_shape)
   return x.permute(0, 2, 1, 3)

query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)

矩阵 $W^{q}$ 、 $W^{k}$ 相乘，得到注意力矩阵，并除以向量的维度的开方。

attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
attention_probs = nn.Softmax(dim=-1)(attention_scores)

将注意力矩阵乘以矩阵 $W^{v}$

context_layer = torch.matmul(attention_probs, value_layer)

变换context_layer维度，为了后面将各头得到的结果拼接。

context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)

完整代码

class LayerNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-12):
        """Construct a layernorm module in the TF style (epsilon inside the square root).
        """
        super(LayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.bias = nn.Parameter(torch.zeros(hidden_size))
        self.variance_epsilon = eps

    def forward(self, x):
        u = x.mean(-1, keepdim=True)
        s = (x - u).pow(2).mean(-1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.variance_epsilon)
        return self.weight * x + self.bias
        
class SelfAttention(nn.Module):
    def __init__(self, num_attention_heads, input_size, hidden_size, hidden_dropout_prob):
        super(SelfAttention, self).__init__()
        if hidden_size % num_attention_heads != 0:
            raise ValueError(
                "The hidden size (%d) is not a multiple of the number of attention "
                "heads (%d)" % (hidden_size, num_attention_heads))
        self.num_attention_heads = num_attention_heads
        self.attention_head_size = int(hidden_size / num_attention_heads)
        self.all_head_size = hidden_size

        self.query = nn.Linear(input_size, self.all_head_size)
        self.key = nn.Linear(input_size, self.all_head_size)
        self.value = nn.Linear(input_size, self.all_head_size)

        self.attn_dropout = nn.Dropout(attention_probs_dropout_prob)

        # 做完self-attention 做一个前馈全连接 LayerNorm 输出
        self.dense = nn.Linear(hidden_size, hidden_size)
        self.LayerNorm = LayerNorm(hidden_size, eps=1e-12)
        self.out_dropout = nn.Dropout(hidden_dropout_prob)

    def transpose_for_scores(self, x):
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        x = x.view(*new_x_shape)
        return x.permute(0, 2, 1, 3)

    def forward(self, input_tensor):
        mixed_query_layer = self.query(input_tensor)
        mixed_key_layer = self.key(input_tensor)
        mixed_value_layer = self.value(input_tensor)

        query_layer = self.transpose_for_scores(mixed_query_layer)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        value_layer = self.transpose_for_scores(mixed_value_layer)

        # Take the dot product between "query" and "key" to get the raw attention scores.
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))

        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
        # [batch_size heads seq_len seq_len] scores
        # [batch_size 1 1 seq_len]

        # attention_scores = attention_scores + attention_mask

        # Normalize the attention scores to probabilities.
        attention_probs = nn.Softmax(dim=-1)(attention_scores)
        # This is actually dropping out entire tokens to attend to, which might
        # seem a bit unusual, but is taken from the original Transformer paper.
        # Fixme
        attention_probs = self.attn_dropout(attention_probs)
        context_layer = torch.matmul(attention_probs, value_layer)
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        hidden_states = self.dense(context_layer)
        hidden_states = self.out_dropout(hidden_states)
        hidden_states = self.LayerNorm(hidden_states + input_tensor)

        return hidden_states