注意力机制模型
模型: 分为 Encoder层,Attention层 和 Decoder层。
将 Encoder层 的每个时间片的激活值 s<t> s < t > 拷贝 Tx 次然后和全部激活值 a (Tx个时间片) 串联作为Attention 层的输入,经过Attention层的计算输出 ny n y 个阿尔法 α,使用不同激活值 a 作为不同阿尔法 α 对每个单词的注意力权重,相乘,即 α⋅a,然后将 ny n y 个这样的相乘作为attention层的输出,即作为 decoder层 一个时间片上输入参与后续运算。
主要思想是将一个时间片的激活值分别和不同的单词注意力权重(使用不同时间片的激活值作为权重)相乘。
下图是上图的 一个 Attention 到 context 部分,也是Attention mechanism(注意力机制)的实现:
过程:
There are two separate LSTMs in this model (see diagram on the left). Because the one at the bottom of the picture is a Bi-directional LSTM and comes before the attention mechanism, we will call it pre-attention Bi-LSTM. The LSTM at the top of the diagram comes after the attention mechanism, so we will call it the post-attention LSTM. The pre-attention Bi-LSTM goes through Tx T x time steps; the post-attention LSTM goes through Ty T y time steps.
翻译:模型中有两个LSTM层,主要区别在于使用注意力机制的前后以及相连的时间片不同。下面的叫 pre-attention Bi-LSTM,上面的叫 post-attention LSTM,其实这两个分别属于Seq2Seq的编码(Encoder)和解码(Decoder)部分。Bi-LSTM是指双向 LSTM。pre-attention Bi-LSTM 和 Tx 个输入时间片,post-attention LSTM 和 Ty 个输出时间片相连。
The post-attention LSTM passes s⟨t⟩,c⟨t⟩ s ⟨ t ⟩ , c ⟨ t ⟩ from one time step to the next. In the lecture videos, we were using only a basic RNN for the post-activation sequence model, so the state captured by the RNN output activations s⟨t⟩ s ⟨ t ⟩ . But since we are using an LSTM here, the LSTM has both the output activation s⟨t⟩ s ⟨ t ⟩ and the hidden cell state c⟨t⟩ c ⟨ t ⟩ . However, unlike previous text generation examples (such as Dinosaurus in week 1), in this model the post-activation LSTM at time t t does will not take the specific generated as input; it only takes s⟨t⟩ s ⟨ t ⟩ and c⟨t⟩ c ⟨ t ⟩ as input. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn’t as strong a dependency between the previous character and the next character in a YYYY-MM-DD date.
翻译: 指向 LSTM 的 s post-attention LSTM 的隐藏层的激活值。post-attention LSTM 没有使用双向的原因是这个例子中的机器时间前后相关不大。这一层的 激活值 s 和 记忆细胞的 c 的 初始输入和普通的LSTM一样(一般取0) ,而输入来自 Attention层 的计算。
We use a⟨t⟩=[a→⟨t⟩;a←⟨t⟩] a ⟨ t ⟩ = [ a → ⟨ t ⟩ ; a ← ⟨ t ⟩ ] to represent the concatenation of the activations of both the forward-direction and backward-directions of the pre-attention Bi-LSTM.
翻译:双向 RNN 的说明
The diagram on the right uses a RepeatVector node to copy s⟨t−1⟩ s ⟨ t − 1 ⟩ ’s value Tx T x times, and then Concatenation to concatenate s⟨t−1⟩ s ⟨ t − 1 ⟩ and a⟨t⟩ a ⟨ t ⟩ to compute e⟨t,t′ e ⟨ t , t ′ , which is then passed through a softmax to compute α⟨t,t′⟩ α ⟨ t , t ′ ⟩ . We’ll explain how to use RepeatVector and Concatenation in Keras below.
翻译: Attention层 的 s<t> s < t > 是指 Encoder层一个时间片上的输出(激活值),将一个时间片上的激活值复制 Tx 次以便和 Encoder 层 Tx 个时间片的输出 a 进行串联(类似增广矩阵),然后一起作为 Attention层 的输入。
模型实现: implementing two functions: one_step_attention() and model() .
one_step_attention(): At step t t , given all the hidden states of the Bi-LSTM () and the previous hidden state of the second LSTM ( s<t−1> s < t − 1 > ), one_step_attention() will compute the attention weights ( [α<t,1>,α<t,2>,...,α<t,Tx>] [ α < t , 1 > , α < t , 2 > , . . . , α < t , T x > ] ) and output the context vector (see Figure 1 (right) for details):
context<t>=∑t′=0Tx阿尔法α<t,t′>⋅a<t′>;(1) (1) c o n t e x t < t > = ∑ t ′ = 0 T x 阿 尔 法 α < t , t ′ > ⋅ a < t ′ > ;Note that we are denoting the attention in this notebook context⟨t⟩ c o n t e x t ⟨ t ⟩ . In the lecture videos, the context was denoted c⟨t⟩ c ⟨ t ⟩ , but here we are calling it context⟨t⟩ c o n t e x t ⟨ t ⟩ to avoid confusion with the (post-attention) LSTM’s internal memory cell variable, which is sometimes also denoted c⟨t⟩ c ⟨ t ⟩ .
翻译: 在每个时间片 t,将Encoder层在 t 时间片上的输出 s<t−1> s < t − 1 > (激活值, t-1而不是 t 是因为代码从0开始)复制 Tx 次以便和Encoder层所以输出串联作为Attention层的输入,经过计算得到 Tx 个输出 阿尔法 α, 将每个 α 和每个时间片的输出(作为注意力权重)相乘,然后求和作为最终的输入 context,也就是 Decoder层的输入。
model(): Implements the entire model. It first runs the input through a Bi-LSTM to get back [a<1>,a<2>,...,a<Tx>] [ a < 1 > , a < 2 > , . . . , a < T x > ] . Then, it calls one_step_attention() Ty T y times (for loop). At each iteration of this loop, it gives the computed context vector c<t> c < t > to the second LSTM, and runs the output of the LSTM through a dense layer with softmax activation to generate a prediction y^<t> y ^ < t > .
翻译: 在model函数中先计算Encoder层,然后利用缓存进行 decoder 层的计算,经过 Ty 个时间片运算得到最终的输出,在每个时间片都会调用一次 Attention层(每次都涉及 Tx 个时间片的 Encoder层 缓存输出)