Lecture 5 Transformer

最新推荐文章于 2024-11-01 14:57:20 发布

Yi_cAt

最新推荐文章于 2024-11-01 14:57:20 发布

阅读量638

点赞数 1

分类专栏： 2022 Spring 李宏毅ML 文章标签： transformer 深度学习自然语言处理

本文链接：https://blog.csdn.net/Yi_cAt/article/details/127124736

版权

2022 Spring 李宏毅ML 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Lecture 5: Sequence to sequence

Transformer

Sequence-to-sequence (Seq2seq)

Seq2seq for Syntactic Parsing

1412.7449.pdf (arxiv.org)

Seq2seq for Multi-label Classification

Seq2seq for Object Detection

End-to-End Object Detection with Transformers (arxiv.org)

Overview of Seq2seq Model

Sequence to Sequence Learning with Neural Networks (arxiv.org)

Attention Is All You Need (arxiv.org)

Encoder

Encoder 所要完成的任务是“给一排向量，输出一排向量”。许多模型都可以做到这一点，如 self-attention、CNN 和 RNN。

Transformer’s Encoder 中的 $\text{block}$ 并不是简单的 $\text{self-attention}$ 。输入向量经过 $\text{self-attention}$ 得到 $\bf a$ ，再加上原输入 $\bf b$ （这种操作叫做 $\text{residual}$ ），最后对这个 $\text{residual}$ 结果进行 $\text{layer normalization}$ 作为下一层 $\text{FC, Fully-Connected Network}$ 的输入（ $\text{Add \& Norm}$ ）。
而在 $\text{FC}$ 中也有 $\text{residual}$ 的设计，输入向量经过 $\text{FC}$ 得到输出向量，将输出向量和原输入向量相加，再进行 $\text{layer normalization}$ 就得到了 Encoder 中一个 $\text{Block}$ 的输出。
$\text{layer normalization}$ 较 $\text{batch normalization}$ 而言更为简单，不需要考虑 $\text{batch}$ 的限制。注意区别， $\text{batch normalization}$ 是对同一维度的不同特征输入做 $\text{norm}$ ，而 $\text{layer normalization}$ 是对同一特征中不同维度的值做 $\text{norm}$ 。

To learn more

On Layer Normalization in the Transformer Architecture (mlr.press)

PowerNorm: Rethinking Batch Normalization in Transformers (mlr.press)

Decoder

Autoregressive (AT) (Speech Recognition as example)

Decoder 的输入是 Encoder 的输出。在语音识别这个例子中，需要有一个特别的 ${\text{token: <bos>}}$ 来告诉 Decoder 识别开始。
输入向量通过 Decoder 经过 $so f t ma x$ 得到一个输出向量，其大小为 $\text{vocabulary}$ 的长度，对每个字给一个分数（概率），取最大值对应的字为最终结果。这个过程会持续下去。

可以看到，Decoder 中上一个时间点的输出是下一个时间点的输入。那么，如果上一个时间点的输出出现错误（如 “器 → 气”），后续的输出会出错吗？—— 暂时放下这个问题。
还需要另一个特别的 ${\text{token: <eos>}}$ 来告诉 Decoder 识别结束。

Non-autoregressive (NAT)

AT v.s. NAT

Masked Multi-Head Attention

Self-attention 的计算过程见：Lecture 4 Sequence as input

如上图所示， $\text{Masked Self-attention}$ 相较于 Self-attention 在计算上有区别。在计算输出 ${\bf{b}}^1$ 时，仅考虑输入 ${\bf a}^1$ ；计算 ${\bf b}^2$ 时，仅考虑 ${\bf a}^1,\ {\bf a}^2$ ；计算 ${\bf b}^3$ 时，仅考虑 ${\bf a}^1,\ {\bf a}^2,\ {\bf a}^3$ ；计算 ${\bf b}^4$ 时，考虑 ${\bf a}^1,\ {\bf a}^2,\ {\bf a}^3,\ {\bf a}^3$ 。
更具体地看，以计算 ${\bf b}^2$ 为例，我们只会用由 ${\bf a}^2$ 计算得到的 ${\text {query}}\ {\bf q}^2$ 和 ${\text{key}}\ {\bf k}^2$ 及由 ${\bf a}^1$ 计算得到的 ${\text {key}}\ {\bf k}^1$ 计算 $\text{attention score}\ \alpha_{2,1}',\ \alpha_{2,2}'$ 。—— 为什么需要 $\text{Masked}$ ？很直观，Decoder 的输入来自于上一时间点的输出，也就是无法同时得到所有的输入向量。

Cross Attention

Cross attention 是连接 Encoder 和 Decoder 的桥梁。Cross attention 的输入共有三个，其中 $\text{\color{blue}两个}$ 来自 Encoder， $\text{\color{green}一个}$ 来自 Decoder。

由 Encoder 的输出 ${\bf a}^{(1,2,3)}$ 计算（ $W^k$ ）得到 $\text{key}\ {\bf k}^{(1,2,3)}$ ，Decoder 中输入经过 $\text{masked multi-head attention}$ 得到一个输出再乘上 $W^q$ 得到 $\text{query}\ {\bf q}$ ，由 ${\bf q}$ 和 ${\bf k}^{(1,2,3)}$ 计算出 $\text{attention score}$ 。
由 ${\bf a}^{(1,2,3)}$ 计算（ $W^v$ ）得到 $\text{value}\ {\bf v}^{(1,2,3)}$ ， ${\bf v}=\text{weightedsum}({\bf v^{1,2,3}})$ ，将 ${\bf v}$ 输入到后续的 $\text{FC}$ 中。
$\text{Cross}$ 体现在 $\text{key}\ {\bf k},\ \text{value}\ {\bf v}$ 来自于 Encoder， $\text{query}\ {\bf q}$ 来自于 Decoder。

Training

在训练的时候，我们将 $\text{ground truth}$ 作为 Decoder 的输入 —— $\text{Teacher Forcing}$ 。而在测试阶段，显然没有 $\text{ground truth}$ 作为 Decoder 的输入；同样，在实际应用中，在 $\text{inference}$ 阶段，也不可能有 $\text{ground truth}$ 作为 Decoder 的输入，Decoder 仅能看到自己的输入 —— 会出现 $\text{mismatch}$ 。