7_Attention（注意力机制）

最新推荐文章于 2024-06-19 15:00:00 发布

少云清

最新推荐文章于 2024-06-19 15:00:00 发布

阅读量998

点赞数

分类专栏： NLP 文章标签：深度学习机器学习 pytorch

本文链接：https://blog.csdn.net/brawly/article/details/122710717

版权

11 篇文章 0 订阅

订阅专栏

一、Seq2Seq Model

Shortcoming: The final state is incapable of remembering a long sequence.

Attention tremendously improves Seq2Seq model.（Attention极大地改善Seq2Seq 模型）
With attention, Seq2Seq model does not forget source input.
With attention, the decoder knows where to focus.（decoder 更新状态的时候，都会再看一遍encoder所有状态，这样就不会遗忘；attention还可以告诉decoder应该关注encoder哪个状态）
Downside: much more computation.（缺点：更多的计算）
https://distill.pub/2016/augmented-rnns/

在Encoder结束工作之后，Decoder 和Attention同时开始工作。 QQ截图20210402153630

方法一： (used in the original paper):在原始论文中使用

方法二：(more popular; the same to Transformer)

问题：How many weights a_i have been computed? （我们共计算了多少权重a_i）

To compute one vector C_i ，we compute m weights: α₁，α₂，α₃，···α_m 。（想要计算出一个 C_i ，我们需要计算m个权重α₁，α₂，α₃，···α_m）
The decoder has t states, so there are totally mt weights.（假设Decoder运行了t步，那么一共计算了 mt 个权重，因此时间复杂度mt）

Standard Seq2Seq model: the decoder looks at only its current state.（标准的Seq2Seq模型：decoder基于当前状态来产生下一个状态，这样产生的新状态可能忘记了encoder的部分输入）
Attention: decoder additionally looks at all the states of the encoder.（注意力机制：decoder 在产生下一个状态之前，会先看一遍encoder的所有状态，于是decoder会知道encoder的完整信息，并不会遗忘）
Attention: decoder knows where to focus.（注意力机制：除了解决遗忘的信息，attention还会告诉decoder应该关注encoder的哪一个状态）
Downside: higher time complexity.（缺点：计算量太大）
- m: source sequence length （假设输入encoder的序列长度为m）
- t: target sequence length （decoder输出的序列长度为t）
- Standard Seq2Seq: O(m + t ) time complexity （标准的Seq2Seq：只需要让encoder读一遍输入序列，之后不再看encoder的输入，然后让decoder依次生成输出序列）
- Seq2Seq + attention: O(mt) time complexity（decoder每一次更新状态，都要把encoder的m个状态看一遍，所以每次的时间复杂度为m，decoder自己有t个状态，因此总时间复杂度是mt。
使用attention可以提高准确率，要付出更多的计算。