NLP_Transformer_Attention_mechanism

最新推荐文章于 2023-03-25 12:26:45 发布

Jason24_Zeng

最新推荐文章于 2023-03-25 12:26:45 发布

阅读量185

点赞数

分类专栏： DL 文章标签：人工智能深度学习算法

本文链接：https://blog.csdn.net/Jason24_Zeng/article/details/109006162

版权

DL 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

Transformer 之前技术机器局限

RNN: 可捕获长距离依赖性息，无法并行
CNN: 能并行，无法捕获长距离依赖关系（需要通过pooling 或 kernel 去扩张感受野
传统Attention

NLP神经网络中某些层的解释

Attention Mechanism(注意力机制)

注意力模型之前比较重要的需要留意的模型：seq2seq, or RNN encoder-decoder,
http://emnlp2014.org/papers/pdf/EMNLP2014179.pdf

较早火起来是Google Mind团队论文 Recurrent Models of Visual Attention. 在RNN中使用attention 进行图像分类。
Attention-based RNN in NLP
First used in RNN paper: Neural Machine Translation by Jointly Learning to Align and Translate
其用在Neural Machine Translation上，NMT是典型的encoder to decoder 模型，使用两个RNN, 一个对原语言进行编译，然后用另一个RNN进行解码翻译。

基本说明白了，那么是如何设计attention的呢？到数学了。

理解：
从下往上看公式：
$e_{ij} = a(s_{i-1}, h_j)$ 每个要翻译的单词的注意力权重 $e_{ij}$ ，会与之前已经翻译了的 $s_{i-1}$ 与该单词训练的 $h_j$ 有关(Bidirectional).
$a(s_{i-1}, h_j) = v_a^{T} \tanh(W_x s_{i-1} + U_a h_j)$ ，这是一个感知机公式，具体原因还需要多了解，忘记了。
$a_{ij} = \frac{\exp{(e_{ij})}}{\Sigma^{T_x}_{k=1} \exp(e_{ik})}$ ， Softmax函数，目的是归一化 $e_{ij}$
$c_i = \Sigma^{T_x}_{j =1}a_{ij}h_j$ 得到一个值，这个值包含了前面整个RNN的信息，会参与对解码位置 $i$ 的单词的预测，切该值因为 $i$ 的不同值也不同
$s_i = f(s_{i-1},y_{i-1},c_i)$ 每个解码 $s_i$ 都需要上一个解码 $s_{i-1}$ , 输出端值 $y_{i-1}$ 与前一RNN输入 $c_i$ 有关
$p(y_i|y_1,...,y_{i-1},x) = g(y_{i-1},s_i, c_i)$ 最后输出单词 $y_i$ 的值与前一个输入的值，RNN输入，以及解码 $s_i$ 有关
Attention 矩阵，应该就是 $\Alpha= \{\alpha_{ij} \}$
另一篇比较有代表性的论文
Effective Approaches to Attention-based Neural Machine Translation
这篇论文提出并测试了两个简单并有效的Attention Mechanism 模型： A global approach: all source words, and a local one: only a subset of source words

这篇文章提到了NMT的特点：

相对于有传统MT会储存大量的短语表格和语言模型，NMT有small memory footprint.
NMT具有庞大的被端对端训练的神经网络，这个神经网络对长序列语句的翻译做得比较好。
NMT解码器比传统MT解码器要简单很多

这篇文章的两个模型： global的与前一篇文章相似但结构更简单，后一个可以看作连接hard and soft attention models之间的纽带，模型在计算中要比global或者soft attention model花销成本更低，同时，不像 hard attention, local attention 几乎处处可微.
结构：

encoder: compute a representation s for each source sentence
decoder, generate one target word at a time:
$\log p(y|x) = \Sigma^m_{j=1} \log p(y_j|y_{<j},\mathbf{\mathit{s}})$
通常，decoder中的decomposition往往使用一个RNN结构（RNN也有不同的结构，这样算出来的***s*** 也不同）

数学:

Parameterize the probability of decoding each word $y_j$ as $p(y_j|y_{<j},\mathbf{\mathit{s}}) = \mathrm{softmax} (g(\bf{\it{h_j}}))$
with g being the transformation function that output a vocabulary-size vector. Here $h_j$ is the RNN hidden unit
$h_j = f(h_{j-1}, s)$
$f$ 用于计算 current hidden state given the previous hidden state $h_{j-1}$ , 并且可能是vanilla(ordinary, conventional) RNN unit, a GRU, or an LSTM unit. ***s***在起初的文章中只用一次去初始化decoder的hidden state. 后来，它 implies a set of source hidden states which are consulted throughout the entire course of the translation process（在整个翻译过程中反复使用，代表了一系列的源hidden layer state–个人理解，每一次的s是不同的，它都含有所有hidden layer state的信息，但是每次倾向性不同）.这就是attention mechanism.
这篇文章运用堆积LSTM结构（stacking），训练目标-loss function:
$J_t = \Sigma_{(x,y)\in \mathcal{D}}-\log p(y|x)$
$\mathcal{D}$ 是我们的parallel training corpus（平行文本训练集）

Attention-based Models

Global and Local model differ in how the context vector $c_t$ is derive. $h_t$ is the target hidden state at the top layer of a stacking LSTM.

Employ a simple concatenation layer to get attention vector $\tilde{h_{t}}$ : $\tilde{h_{t}} = \tanh(W_c[c_t;h_t])$

然后, 运用softmax层预测分布函数：
$p(y_t|y_{<t},x) = \text{softmax}(W_s\tilde{h_{t}} )$
细节：
a. Global Attention
Global attention model
这个模型要注意一个变量长度的alignment vector $a_t$ （所以什么是alignment vector?），大小与源端时序长度相等，公式如下：
$\mathbf{a_t}(s) = \text{align}(h_t,\bar{h_s}) =\frac{\exp(\text{score}(h_t,\bar{h_s}))}{\Sigma_{s'}\exp(\text{score}(h_t,\bar{h_s'}))}$
其中，score是一个content-based function，文中考虑三种不同的表达:

这个想法很接近之前的文章，但有几处不同：

use hidden states at the top LSTM layers in both the encoder and decoder. 而前一篇文章，使用 concatenation of the forward and backward source hidden states in the bi-directional encoder, 而target hidden states 的decoder则是non-stacking unidirectional.
计算容易： $h_t \to a_t\to c_t\to \tilde{h}_t$ , 然后做预测。而前一篇文章，任意时刻t, 需要通过前一个hidden state $h_{t-1}\to a_t\to c_t\to h_t$ ，然后在做预测之前轮流遍历一个 deep-output 和一个maxout层。
前一篇文章只用了一个aligment function: concat product；这篇文章用了其他的。

b. Local Attention
Local Attention Model
Global Attention专注于一句话中所有的单词，expensive且impratical to translate long sentences.
该模型从soft and hard attentional models的tradeoff中得到启发。（soft即global attention， hard则是每次只选择一个source word, non-differentiable, 需要更复杂的技术，比如variance reduction or reinforcement learning to train）
该模型选择性得专注于文本得一小部分（small window of context），因此是differentiable。
优点： This approach has an advantage of avoiding the expensive computation incurred in the soft attention and at the same time, is easier to train than the hard attention approach.
细节：模型首先在时间 $t$ 对每个target word产生了一个aligned position $p_t$ . 然后文本context vector $c_t$ 通过在某个windows $p_t - D, p_t +D]$ 里所有得source hidden states进行weighted average, 其中D是按经验选择得，从而， $a_t$ 相比于gloabal中变味了固定长度得，也就是 $\in \mathrm{R}^{2D+1}$ .
考虑两个变量：

Monotonic alignment (local-m): 简单设置 $p_t = t$ , 假设源和目标序列是单调aligned，alignment vector $a_t$ 同global
Predictive alignment (local-p): 模型用以下公式预测aligned position:
$p_t = S\cdot\text{sigmoid}(v_p^\perp\tanh(W_p h_t))$
其中， $W_p$ 和 $v_p$ 是将要被学习去预测位置得模型参数， S表示源句长度，从而， $p_t\in[0,S]$ . 为了让alignment倾向于接近 $p_t$ 的点，为们在 $p_t$ 附近加一个Gaussian distribution. 也就是：
$a_t(s) = \text{align}(h_t,\bar{h}_s)\exp(-\frac{(s-p_t)^2}{2\sigma^2})$ , alignment和global中一致，set $\sigma = \frac{D}{2}$ , 需注意 $p_t$ 是实数，而s是在窗口中心的整数。

Global and Local方法没有考虑过输出的序列关系，所以并不是最优的，可以通过Input-Feeding Approach去实现。