【Attention(11)】经典Attention架构:Bahdanau Attention Mechanism 和Luong Attention

Hali_Botebie

已于 2023-08-28 15:16:14 修改

阅读量732

点赞数 1

分类专栏：注意力/Transformer/ViT/DETR/Seg 文章标签：机器学习人工智能深度学习

于 2023-04-26 16:10:55 首次发布

原文链接：https://zhuanlan.zhihu.com/p/577664617

版权

注意力/Transformer/ViT/DETR/Seg 专栏收录该内容

51 篇文章

订阅专栏

Bahdanau et al. 在2014年发表了论文Neural Machine Translation by Jointly Learning to Align and Translate 首次提出了Attention注意力机制，爆火RNN网络。后来出现了一些变种，如Luong et al. 在2015年发表的 Effective Approaches to Attention-based Neural Machine Translation

Bahdanau Attention是一个Seq2Seq模型，它加强了传统Seq2Seq模型中Decoder对于Encoder信息的提取。对于传统的Seq2Seq模型的架构可以见链接

文章目录

The Bahdanau Attention Mechanism - Machine Learning Mastery
- Attention计算流程
The Luong Attention Mechanism - Machine Learning Mastery
- Global attention model
- Local attention model
参考

The Bahdanau Attention Mechanism - Machine Learning Mastery

paper:https://machinelearningmastery.com/the-bahdanau-attention-mechanism/

在这里插入图片描述

Bahdanau在Encoder中用了双向RNN，不过这不是关键，关键是Decoder的每一个时刻，都考虑全部的Encoded sequence，在此基础上计算一个context vector

记 si 为Decoder在时刻 i 的hidden state.
在这里插入图片描述

可见，si 的计算，用到了上一个时刻的hidden state si-1、上一个时刻的输出 yi-1 ，以及 i 时刻专属的contexte vector ci.

这个 ci 如何计算呢？–> 它是Encoded sequence h1 h2 … ht的线性组合。

在这里插入图片描述

其中eij，我们可以把它称为Attention score，因为它衡量了 j 位置的输入与 i 位置的输出之间的匹配程度。

在这里插入图片描述 Bahdanau用一个简单的前馈神经网络模拟函数a ，即

由于它把 si-1 和 hj “加”起来了，所以这个Attention mechanism也被称为additive attention.

The decoder decides parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixedlength vector. – Neural Machine Translation by Jointly Learning to Align and Translate

上面这段话总结了Attention机制的精髓：由Decoder决定Encoded sequence的哪部分是重要的，给予更高的权重 aij ；这样解放了Encoder，使得它不再需要把整个输入序列编码为一个固定长度的context vector.

还有一个细节要注意，原文中Encoder用了双向RNN，在每一个位置，有正向和反向的 hij . 将二者合并一下，得到Encoded sequence：
在这里插入图片描述

Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated. The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.

Attention计算流程

总结一下：

1、Encoder计算出Encoded sequence h1,h2,…ht
2、利用一个前馈神经网络模拟函数 a ，计算Attention score—— eij=a(si-1,hj)
3、在 j 所在的维度上对 eij 作用一个softmax函数，得到在这里插入图片描述

4、对于Decoder的位置 i ，计算context vector ci，它是Encoded sequence h1,h2,ht的线性组合，对应系数为aij ：在这里插入图片描述

5、通过 si-1,yi-1,ci，计算 si 以及yi ，计算细节见下图：
在这里插入图片描述

The Luong Attention Mechanism - Machine Learning Mastery

paper:https://machinelearningmastery.com/the-luong-attention-mechanism/

In this work, we design, with simplicity and effectiveness in mind, two novel types of attention-based models: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time. – Effective Approaches to Attention-based Neural Machine Translation

Luong在他的论文中提出了两个注意力机制，一个Global，一个Local。前者考虑全部Encoded sequence，对经典注意力机制作了一些修改；后者则只考虑部分Encoded sequence，用来计算Attention score eij

Global attention model

流程：
在这里插入图片描述

与Bahdanau方法的差异之处：
在这里插入图片描述 The Bahdanau architecture (left) vs. the Luong architecture (right)

在这里插入图片描述
我认为，Luong的主要思想是：用当前时刻Decoder的隐藏状态，去计算当前时刻的一系列Attention score，其余的都是为这个思想铺路。作为对比，Bahdanau直接用了前一个时刻Decoder的隐藏状态，去计算当前时刻的一系列Attention score.

另外，在Encoder和Decoder隐藏状态的计算上，Luong使用了LSTM深度网络，取最后（最上面）一层的输出作为隐藏状态；而Bahdanau用的是双向RNN单元。

最后，Luong拓展了计算Attention score的方法，不仅有additive attention，还有multiplicative attention，后者在self-attention里也有体现。

Local attention model

The global attention has a drawback that it has to attend to all words on the source side for each target word, which is expensive and can potentially render it impractical to translate longer sequences, e.g., paragraphs or documents. To address this deficiency, we propose a local attentional mechanism that chooses to focus only on a small subset of the source positions per target word. – Effective Approaches to Attention-based Neural Machine Translation

它的思路是，对于Decoder的某一个位置 i，如果去计算它相对于Encoded sequence每一个位置的Attention score，计算量是很大的。那么我能不能只计算它相对于Encoded sequence某一个窗口内各个位置的Attention score呢？

当然可以。问题是，如何选择Encoded sequence的窗口？

在这里插入图片描述
写在最后