Seq2Seq + attention 模型原理、训练，以及编码过程

最新推荐文章于 2024-06-05 10:27:35 发布

Chen_Meng_

最新推荐文章于 2024-06-05 10:27:35 发布

阅读量5.3k

点赞数 8

分类专栏： deep learning NLP 文章标签：深度学习 seq2seq LSTM

本文链接：https://blog.csdn.net/Chen_Meng_/article/details/103786231

版权

NLP 同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

deep learning

2 篇文章 0 订阅

订阅专栏

3. Seq2Seq with Attention

1. 简介

Seq2Seq的基本结构是encoder-decoder，这个模型的目标是生成一个完整的句子。这个模型曾经使得谷歌翻译有较大幅度的提升，下面就以机器翻译为例子，来描述详述这个模型。

注：学习此模型需要有LSTM深度学习模型相关基础。

2. Seq2Seq

Seq2Seq框架依赖于encoder-decoder。 encoder对输入序列进行编码，而decoder生成目标序列。

2.1 Encoder

在encoder中输入hao are you ，每个单词，都被映射成一个 $d$ 维的词向量 $w\subset \mathbb{R}^{d}$ ，在这个例子中，输入将被转化成 $[w_{0},w_{1},w_{2}]\subset \mathbb{R}^{d\times 3}$ ，经过LSTM后，我们可以得到每一个词对应的隐状态 $[e_{0},e_{1},e_{2}]$ ，，和代表这个句子的向量 $e$ ，在这里， $e_{2} = e$ 。

2.2 Decoder

现在我们已经得到了代表句子的向量 $e$ ，这里我们将使用这个向量，输入到另一个LSTM单元，以特殊字符 $w_{sos}$ 作为起时字符，得到目标序列。

当时间步等0时：

$h_{0}=LSTM(e,w_{sos})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (1)$

$s_{0} = g(h_{0})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (2)$

$p_0 = softmax(s_{0}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (3)$

$i_{0} = argmax(p_{0})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (4)$

$\huge e$ ：Encoder输出的句子向量

$\huge w_{sos}$ ：特殊词，代表起时位置，作为当前时间步骤的输入

$\huge h_{0}$ ：当前时间步骤的隐状态。 $\huge h_{0}\subset \mathbb{R}^{h}$ ， $\huge h$ 隐层的维度

$\huge s_{0}$ ：词表中，每个词的得分。 $\huge s_{0}\subset \mathbb{R}^{v}$ ， $\huge v$ 词表的大小

$\huge g$ ：函数(其实就是矩阵,w 和 b)， $\huge \mathbb{R}^{h} \mapsto \mathbb{R}^{v}$

$\huge p_{0}$ ： $\huge s_{0}$ 经过 $\huge softmax$ 归一化后得到在词表上的概率分布， $\huge p_{0}\subset \mathbb{R}^{v}$ ， $\huge v$ 词表的大小

$\huge i_{0}$ ： $\huge p_{0}$ 中最大概率词的索引。int值。

当时间步等于1时：

$h_{1}=LSTM(h_{0},w_{i_{0}}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (5)$

$s_{1} = g(h_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (6)$

$p_1 = softmax(s_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (7)$

$i_{1} = argmax(p_{1})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (8)$

与时间步等0不同的时，LSTM的输入

$e\rightarrow h_{0}$ ，隐状态的输入从e变成上一个时间步的隐状态

$w_{sos}\rightarrow w_{i_{0}}$ ，词也变成上一个时间步预测的词。

一直到预测到了特殊字符 $<eos>$ ，才停止。

上面的方法其实就是做了这么一个转换：

$\mathbb{P}[y_{t+1}|y_{1},\cdots ,y_{t},x_{0},\cdots ,x_{n}] \mapsto \mathbb{P}[y_{t+1}|y_{t},h_{t},e]$

3. Seq2Seq with Attention

通常来说，seq2seq 加入attention机制后，会使得模型的能力所以提高。模型在解码阶段时可以关注对encoder序列的特定部分，而不是仅仅依赖于代表整个句子的向量 $\huge e$ 。

加入attention机制后，encoder的过程不变，decoder过程发生相应的变化

3.1 Decoder

$\huge h_{t} = LSTM(h_{t-1},[w_{i_{t-1}},c_{t}])\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (9)$

$\huge s_{t}=g(h_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (10)$

$\large \dpi{80} \huge p_{t}=softmax(s_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (11)$

$\huge i_{t}=argmax(p_{t})\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (12)$

$\huge h_{t-1}$ ：是上一个时间步的隐层输入。 $\huge h_{t-1}\subset \mathbb{R}^h$

$\huge h_{t}$ ：当前时间步的隐层输入,也是上一个时间步的输出。 $\huge h_{t}\subset \mathbb{R}^h$

$\huge w_{i_{t-1}$ ：是上一个时间步的词向量 $\huge w_{i_{t-1}}\subset\mathbb{R}^{d}$ , $\huge d$ 表示词向量的维度

$\huge c_{t}$ ：是context vec，叫做上下文向量，是对encoder的output求加权和的结果， $\huge c_{t}\subset \mathbb{R}^{d}$ , $\huge d$ 是LSTM隐层的维度

$\huge g$ ， $\huge s_{t}$ ， $\huge p_{t}$ ， $\huge i_{t}$ 在2.1 已经做了说明，这里完全相同，下面看 $\huge c_{t}$ 是怎么得到的

$\huge \alpha _{t^{'}} = f(h_{t},e_{t^{'}})\subset \mathbb{R} \qquad \qquad for \quad all \quad t^{'}\cdots \cdots (13)$

$\huge \bar{\alpha } = softmax(\alpha)\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (14)$

$\huge c_{t} = \sum_{t^{'}=0}^{T}\bar{\alpha_{t^{'}}}e_{t^{'}}\cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (15)$

$\huge e_{t^{'}}$ ：是encoder是时步为 $\huge {t^{'}}$ 的隐层；

$\huge h_{t}$ ：当前时间步骤隐层的输入；

$\huge \alpha _{t^{'}}$ ：decoder当前时间步对encoder时间步为 $\huge {t^{'}}$ 关注度的得分；

$\huge \alpha =[\alpha_0,\alpha_1,\cdots ,\alpha_T]$ encoder每个时间步骤得分的向量

$\huge \bar{\alpha }$ 是 $\huge \alpha$ 进行softmax 归一化的后的值， $\huge \bar{\alpha} =[\bar{\alpha_{0}},\bar{\alpha_{1}},\cdots ,\bar{\alpha_{T}}]$

$\huge c_{t}$ ：在decoder时间步骤为 $\huge t$ 时刻，对encoder的output求加权和的结果。

而对于函数 $\huge f$ ,通常有以下几种选择，但是不限于以下三种，什么运算效果好，用什么运算。

$\huge f(h_{t},e_{t^{'}})=\left\{\begin{matrix} &h_{t}^{T}e_{t^{'}} &dot \\ &h_{t}^{T}We_{t^{'}} &general \\ &v^{T}tanh(W[h_{t},e_{t^{'}}]) &concat \end{matrix}\right. \cdots \cdots \cdots (16)$

4. Train

回顾例子，目标是进行翻译，将“how are you” 翻译成 "comment vas tu"

如果在训练阶段，decoder的过程中，将t-1时间步预测的词，作为t时间步的输入词，很有可能在某一步预测错误，后面的序列将会全部乱掉，导致错误积累，并且使得模型无法在正确的输入分布中进行，会导致模型训练缓慢，甚至无法进行下去，为了加快处理速度。一个技巧是输入token序列: $\huge [<sos>,comment,vas,tu]$ ，并且预测对应位置的下一个token $\huge [comment,vas,tu,<eos>]$ 。

decoder模型，每一个时间步 $\huge t$ 的输出是词表上的一个概率 $\huge p_{t}\subset \mathbb{R}^{v}$ , $\huge v$ 是词表的大小，对于给定的目标序列， $\huge [y_{1},y_{2},\cdots ,y_{n}]$ ，我们可以计算出整个句子的概率：

$\huge \mathbb{P}(y_{1},\cdots ,y_{n})=\prod_{t=1}^{n}p_{t}[y_{i}]\cdots \cdots \cdots \cdots (17)$

这里 $\huge p_{t}[y_i]$ 是指decoder第t和时间步上，生成第 $\huge i$ 个单词的概率，我们要使得这个这个概率在目标序列上最大化，等价于使得：

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) \cdots \cdots \cdots \cdots \cdots \cdots \cdots \cdots (18)$

最小化，我们定义式子18这个作为损失函数。

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) =-log\prod_{t=1}^{T}p_{t}[y_{i}]\cdots \cdots (19)$

$\huge -log\mathbb{P}(y_{1},\cdots ,y_{n}) =-\sum_{t=1}^{T}logp_{t}[y_{i}]\cdots \cdots (20)$

再具体的例子中，我们的目标就是最小化：

$\tiny -logp_{1}[comment]-logp_{2}[vas]-logp_{3}[tu]-logp_{4}[<eos>]\cdots \cdots(21)$

这里的损失函数其实就是交叉熵损失(Cross Entropy)

5. Decoding

这里主要是说明解码过程，不是解码器

5.1 理论

在解码的过程中，采用一种贪婪的模式，将上一步预测的最后可能的词，作为输入，传入到下一步。但是这种方法，一旦在一步发生错误，就可能会造成整个解码序列的错乱，为了尽可可能降低(目前并不能消除)这个风险，采用一种Beam Search的方法，我们的目标不是得到当前时间步上的最高的分，而是得到前 $\tiny k$ 个的最高得分。

那么对于在时间步 $\tiny [1,t]$ 上的解码假设集合 $\tiny H_{t}:=\{(w_{1}^{1},\cdots ,{w_{t}^{1}),\cdots ,(w_{1}^{k},\cdots ,{w_{t}^{k})\}$ 一共 $\tiny k$ 组，下角标代表时间步，上角标代表top_k的第k个word。

那么是如何从 $\tiny H_{t}$ 在 $\tiny t+1$ 时刻得到候选集合 $\tiny C_{t+1}$ 呢？

$\tiny C_{t+1}:=\bigcup_{i=1}^{k}\{(w_{1}^{i},\cdots ,{w_{t}^{i},1),\cdots ,(w_{1}^{i},\cdots ,{w_{t}^{i},v)\}$ ，这个候选集合一共有 $\tiny k\times v$ 个，然后再从中选取 $\tiny k$ 个最高的，作为 $\tiny H_{t+1}$ 。

注意：这里时从词表中选取的词汇，词表一共 $\large v$ 个词，因为这里将会是一个非常重要的点，与下一篇指针网络有所不同。

5.2 实例

假设 $\tiny k=2$ ，假设 $\tiny H_{2}:=\{(comment,vas),(comment,tu)\}$ ，假设decoder一共就三个词 $\tiny [comment,vas,tu]$ 可选, $\tiny v=3$ 。

那么在 $\tiny t=2$ 一共有2种输出 $\tiny [vas,tu]$ ，在 $\tiny t=3$ 时，认为此时模型的 $\large \dpi{200} \tiny batch\_size=2$ ，，将 $\tiny [vas,tu]$ 输入模型，得到的输出 $\small output_{3}$ 是的 $\tiny shape$ 是 $\small (batch\_size,vob\_size)=(2,3)$ 。一共6个，即为候选集合：

$\small C_{3}=\{(comment,vas,comment),(comment,vas,vas),(comment,vas,tu)\}\\\cup \{(comment,tu,coment),(comment,tu,vas),(comment,tu,tu)\}$

从 $\small output_{3}[0]$ 中挑选出 $\small t=3$ 时刻 $\large log(p)$ 最高2个词，再从 $\small output_{3}[1]$ 中挑出 $\large log(p)$ 最高的2个

组成 $\small \bar{c_{3}}$

$\large \bar{C_{3}}=\{(comment,vas,comment),(comment,vas,tu)\}\cup \{(comment,tu,coment),(comment,tu,vas)\}$

然后后再从 $\small \bar{C_{3}}$ 中挑选整个句子得分 $\large score$ 最高的 $\small k=2$ 个。得到 $\small H_{3}$

$\small H_{3}:=\{(comment,vas,tu),(comment,tu,vas)\}$

下面说明 $\large score$ 的计算方法：

目标，从 $\small \bar{C_{3}}$ 中挑选 $\small k=2$ 个 $\large score$ 最大的句子。 $\LARGE {score}_{i}$ 代表 $\small \bar{C_{3}}$ 中第 $\small i$ 个句子的得分， $\small i\in [1,2,3,4]$ 。

$\large {score}_{1} = (logp(comment)+logp(vas)+logp(comment))/3$

$\large {score}_{2} = (logp(comment)+logp(vas)+logp(tu))/3$

$\large {score}_{3} = (logp(comment)+logp(tu)+logp(comment))/3$

$\large {score}_{4} = (logp(comment)+logp(tu)+logp(vas))/3$

这里之所以是要除以句子长度，是因为句子的长度会印象得分，我们以 $\large {score}_{1}$ 作为例子

$\large {score}_{1} = log\sqrt[3]{p(comment)*p(vas)*p(comment)}$

如果不开3次方，那么句子的长度越大，连乘的概率越小，那么在做最终的预测时，模型预测出的结果将会偏向于预测较短的句子

得到 $\small H_{3}$ 后，知道预测到结束字符<eos>这个句子就停止生成了，然后从候选的结果中，返回score最高的句子，作为最终的输出。

6 总结

从上至下，分别讲述了seq2seq模型的基本结构，和attention机制，并且介绍了这一类模型如何进行训练，如何进行生成。博客的内容如果不足之处，欢迎批评指正。

说明，这一篇博客，没有做项目代码的实现，因为下一篇博客，的指针生成网络会包含seq2seq+attention，并且有实现代码。以及结果的展示。

实现的博客链接(注：不是单纯的实现了seq2seq，是实现了一个基于seq2seq的模型，指针生网络，论文地址)

Chen_Meng_

关注

8
点赞
踩
43

收藏

觉得还不错? 一键收藏
3
评论
Seq2Seq + attention 模型原理、训练，以及编码过程

目录1. 简介2. Seq2Seq2.1 Encoder2.2 Decoder3. Seq2Seq with Attention3.1 Decoder4. Train5. Decoding5.1 理论5.2 实例6 总结1. 简介Seq2Seq的基本结构是encoder-decoder，这个模型的目标是生成一个完整的句子。这个模型曾经使得谷...
复制链接

扫一扫