Attention model 注意力模型

最新推荐文章于 2021-08-01 22:43:31 发布

DecafTea

最新推荐文章于 2021-08-01 22:43:31 发布

阅读量139

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/DecafTea/article/details/111249350

版权

NLP 专栏收录该内容

52 篇文章 3 订阅

订阅专栏

1. Encoder：Bi-RNN

在这里插入图片描述
x(1:T): input sequence with the length of T
a(1:T)：hidden state/ activation value/ history vector, 表示当前与之前的信息存储
a(0), a(end): 起始/终止符的a
a‘(1:T)：concatenate 正向&逆向的history vector -->a(t) and <–a(t)。

2. Decoder：

在这里插入图片描述
s：为与a区别，标记为s
y^(1:T): output sequence

”?“ 所标示的input输入十分关键，决定output y(t)所能pay attention to下面Bi-RNN的哪些部分。For each y(t), we have:
在这里插入图片描述

**关键问题：**怎么得到每个y(t)的attention weight = alpha(t, t’) i.e. the attention amount that y(t) pay to each a’(t)?

每个y(t)的attention weight （i.e. alpha(t, t’)）和什么有关？
和之前的y的hidden state s有关，即s(t-1)；和encoder学到的与之对应（time step相同）的hidden state a‘有关，即a’(t).

Hence, 训练一个神经网络，which takes s(t-1) and a’(t) as inputs，and outputs e(t, t’), and alpha(t, t’) = softmax(e(t, t’)), with the length of the sequence length T （对应着对sequence 1：T中每个word的注意力权重，求和 = 1，这是我们使用softmax的原因）. 不同论文提出了不同的神经网络设计方式，让注意力更好的集中，输出更好的结果。

reference:

[1] https://www.bilibili.com/video/BV1cb411W7w9
[2] 4F10 notes: Deep Learning for Sequence Data