搭建Transformer模型

最新推荐文章于 2024-09-06 09:00:00 发布

RuizhiHe

最新推荐文章于 2024-09-06 09:00:00 发布

阅读量1.9k

点赞数 3

分类专栏：自然语言处理文章标签：人工智能机器学习深度学习自然语言处理 attention

本文链接：https://blog.csdn.net/qq_24178985/article/details/118884171

版权

自然语言处理专栏收录该内容

10 篇文章 14 订阅

订阅专栏

1. 前言

本文使用Attention Layer与Self-Attention Layer搭建深度神经网络——Transformer模型。
本人全部文章请参见：博客文章导航目录
本文归属于：自然语言处理系列
本系列实践代码请参见：我的GitHub
前文：Attention is all you need：剥离RNN，保留Attention
后文：BERT与ERNIE

2. 多头注意力机制（Multi-Head Attention）

2.1 多头自注意力层（Multi-Head Self-Attention Layer）

根据前文可知，Self-Attention层输入为 $X=[x_1,x_2,\cdots,x_m]$ ，输出为与输入序列长度相同的序列 $C=[c_1,c_2,\cdots,c_m]$ ，即 $C = A t t n (X, X)$ 。前文所述Self-Attention层也可以称为单头Self-Attention层（Single-Head Self-Attention Layer）。
多头Self-Attention层由 $l$ 个单头Self-Attention组成，每个单头Self-Attention相互独立，不共享参数。每个单头Self-Attention存在3个参数矩阵 $W_Q,W_K,W_V$ ，由 $l$ 个单头Self-Attention组成的多头Self-Attention层共存在3 $l$ 个参数矩阵 $W_{Q1},W_{Q2},\cdots,W_{Ql},W_{K1},W_{K2},\cdots,W_{Kl},W_{V1},W_{V2},\cdots,W_{Vl}$ 。

根据上图可知，在多头Self-Attention层中，所有单头Self-Attention层输入均为 $X$ ，每个单头Self-Attention均会输出一个长度为 $m$ ，每个元素维度为 $d$ 的输出序列，多头Self-Attention层输出是由所有单头Self-Attention的输出序列拼接而形成的长度为 $m$ ，每个元素维度为 $l d$ 的序列。
多头Self-Attention层可用下图表示，与前文所述Self-Attention层相比，多头Self-Attention层输出序列每个元素维度以及参数矩阵数量均是单头Self-Attention层的 $l$ 倍。

可以将单头Self-Attention比作卷积神经网络卷积层中的一个卷积核。在卷积神经网络中，卷积层一般会使用多个卷积核，从而获取上一层图像中不同的特征。在Transformer模型中，Attention层会使用多头Self-Attention层，从而获取上一层序列中不同的注意力特征。

2.2 多头注意力层（Multi-Head Attention Layer）

与多头Self-Attention层相似，可以使用 $l$ 个单头Attention构成多头Attention层。所有单头Attention输入均为 $X$ 和 $X^\prime$ ，同时各个单头Attention相互独立，不共享参数。将 $l$ 个单头Attention的输出 $C_1,C_2,\cdots,C_l$ 拼接，即构成多头Attention层的输出。

3. 搭建Transformer模型

Transformer是一个Seq2Seq模型，有一个Encoder和一个Decoder。Transformer模型是目前机器翻译等NLP问题最好的解决办法，效果比RNN有大幅提高。

3.1 Transformer’s Encoder

Transformer模型的Encoder由多个Encoder Block构成，每个Encoder Block由多头Self-Attention层与全连接层构成。
据2.1可知，多头Self-Attention层输入为 $X=[x_1,x_2,\cdots,x_m]$ ，输出为 $C=[c_1,c_2,\cdots,c_m]$ ，输出序列 $C$ 的每个元素维度均为 $l d$ 。将多头Self-Attention层输出序列 $C$ 的各个元素 $c_i,~i=1\sim m$ 分别输入全连接层，得到Encoder Block输出序列 $U=[u_1,u_2,\cdots,u_m]$ ， $u_i=ReLU(W_Uc_i),~i=1\sim m$ 。
由多头Self-Attention层与全连接层构成的如图四所示的结构即为Encoder Block。其中将 $c_i$ 变换为 $u_i$ 的 $m$ 个Dense模块完全相同，即图四中所有Dense模块参数矩阵均为 $W_U$ 。

Encoder Block输入序列 $X$ 和输出序列 $U$ 元素个数相同。根据前文可知， $c_i$ 依赖于所有 $m$ 个输入 $x_1,x_2,\cdots,x_m$ ，因此 $u_i$ 同样依赖于所有m个输入 $x_1,x_2,\cdots,x_m$ 。改变输入序列中任何一个元素，Encoder Block所有输出序列 $U$ 中所以元素均会发生改变。

在Transformer’s Encoder中，一个Block包含一个多头Self-Attention层和一个全连接层，Encoder Block按照图四所示方式将序列 $X$ 映射成序列 $U$ 。序列 $X$ 和 $U$ 长度均为 $m$ ，序列中各个元素 $x_i$ 和 $u_i$ 维度均为512。即在Transformer模型中，Encoder Block输入和输出均是大小为 $512\times m$ 的矩阵。

Transformer’s Encoder结构如下图所示，共由6个Encoder Blocks依次堆叠而成，各个Block之间相互独立，不共享参数。Transformer’s Encoder的输入和输出均是大小为 $512\times m$ 的矩阵。

Transformer’s Encoder各个Block输入和输出均是大小为 $512\times m$ 的矩阵，搭建Transformer’s Encoder可以使用ResNet中的跳层链接（Skip Connection）技巧，以及常见的Batch Normalization等技巧。

3.2 Transformer’s Decoder

Transformer模型的Decoder由多个Decoder Block构成，每个Decoder Block由多头Self-Attention层、多头Attention层和全连接层构成。
如下图所示，Decoder Block第一层为多头Self-Attention层，其输入为 $[x_1^\prime,x_2^\prime,\cdots,x_t^\prime]$ ，输出为 $[c_1,c_2,\cdots,c_t]$ 。第二层是多头Attention层，其输入为Encoder输出 $[u_1,u_2,\cdots,u_m]$ 和第一层输出 $[c_1,c_2,\cdots,c_t]$ ，第二层的输出是 $[z_1,z_2,\cdots,z_t]$ 。将第二层输出序列 $Z$ 的各个元素 $z_i,~i=1\sim t$ 分别输入全连接层，得到Decoder Block输出序列 $S=[s_1,s_2,\cdots,s_t]$ ， $s_i=ReLU(W_Sz_i),~i=1\sim t$ 。
与Encoder Block类似，Decoder Block中将 $z_i$ 变换为 $s_i$ 的 $t$ 个Dense模块完全相同，即图七中所有Dense模块参数矩阵均为 $W_S$ 。

在Transformer’s Decoder中，一个Block包含一个多头Self-Attention层、一个Attention层和一个全连接层，Decoder Block按照图七所示方式，输入序列 $X$ 和 $X^\prime$ ，输出序列 $S$ 。序列 $X$ 长度为 $m$ ，输出序列 $S$ 长度与 $X^\prime$ 一致，均为 $t$ 。所有三个序列中各个元素维度均为512。

Transformer’s Decoder共由6个Decoder Block依次堆叠而成，每层Decoder Block的输入序列 $X$ 均为Encoder的输出， $X^\prime$ 为上一层Decoder Block的输出。

3.3 Transformer模型结构

Transformer’s Encoder由依次叠加的6个Encoder Block构成，每个Encoder Block有两层，分别是多头Self-Attention层和全连接层。Encoder的输入是大小为 $512\times m$ 矩阵 $X$ ，输出矩阵 $U$ 也是大小为 $512\times m$ 的矩阵，与输入矩阵 $X$ 大小完全相同。
将Transformer’s Encoder和Transformer’s Decoder连接起来，即构成Transformer模型。

本文讲解Transformer’s Decoder结构时直接将 $X^\prime$ 整体作为输入。基于Transformer的Seq2Seq模型生成输出序列时，过程如下：

输入 $x_t^\prime$ ，用三个参数矩阵 $W_{Q_{j1}}$ ， $W_{K_{j1}}$ 和 $W_{V_{j1}}$ 分别对 $x_t^\prime$ 做线性变换，得到 $q_t$ ， $k_t$ 和 $v_t$ ；
$j$ 是指Multi-Head Self-Attention中第 $j$ 个Self-Attention，1表示参数矩阵位于Decoder Block第一层（多头Self-Attention层）
计算向量 $k_i$ 与 $q_t$ 的内积，得到 $\tilde{\alpha}_{ti}$ ；
$\tilde{\alpha}_{ti}=k_i^Tq_t,~for~i=1~to~t$
对 $\tilde{\alpha}_{t1},\tilde{\alpha}_{t2},\cdots,\tilde{\alpha}_{tt}$ 进行 $S o f t m a x$ 变换，把输出记作 $\alpha_{t1},\alpha_{t2},\cdots,\alpha_{tt}$ ；
$[\alpha_{t1},\alpha_{t2},\cdots,\alpha_{tt}]=Softmax([\tilde{\alpha}_{t1},\tilde{\alpha}_{t2},\cdots,\tilde{\alpha}_{tt}])$
对所有 $v_i$ 求加权平均，得到Context Vector $c_{tj}$ ；
$c_{tj}=\alpha_{t1}v_1+\alpha_{t2}v_2+\cdots+\alpha_{tt}v_t$
对多头Self-Attention层所有Self-Attention执行步骤1-4，将 $c_{t1},c_{t2},\cdots,c_{tl}$ 拼接形成 $c_t$ ；
分别用两个参数矩阵 $W_{K_{j2}}$ 和 $W_{V_{j2}}$ 对Encoder输出序列每个元素均做线性变换，得到 $k_i$ 和 $v_i$ ；
$k_i=W_{K_{j2}}\cdot u_i,~for~i=1~to~m$
$v_i=W_{V_{j2}}\cdot u_i,~for~i=1~to~m$
$j$ 是指Multi-Head Attention中第 $j$ 个Attention，2表示参数矩阵位于Decoder Block第二层（多头Attention层）
用参数矩阵 $W_{Q_{j2}}$ 对步骤5中得到的 $c_t$ 做线性变换，得到 $q_t$ ；
计算向量 $k_i$ 与 $q_t$ 的内积，得到 $\tilde{\alpha}_{ti}$ ；
$\tilde{\alpha}_{ti}=k_i^Tq_t,~for~i=1~to~m$
对 $\tilde{\alpha}_{t1},\tilde{\alpha}_{t2},\cdots,\tilde{\alpha}_{tm}$ 进行 $S o f t m a x$ 变换，把输出记作 $\alpha_{t1},\alpha_{t2},\cdots,\alpha_{tm}$ ；
$[\alpha_{t1},\alpha_{t2},\cdots,\alpha_{tm}]=Softmax([\tilde{\alpha}_{t1},\tilde{\alpha}_{t2},\cdots,\tilde{\alpha}_{tm}])$
对所有 $v_i$ 求加权平均，输出 $z_{tj}$ ；
$z_{tj}=\alpha_{t1}v_1+\alpha_{t2}v_2+\cdots+\alpha_{tm}v_m$
对多头Attention层所有Attention执行步骤6-10，将 $z_{t1},z_{t2},\cdots,z_{tl}$ 拼接形成 $z_t$ ；
将 $z_t$ 输入Dense模块，得到 $s_t$ ；
将 $s_t$ 输入 $S o f t m a x$ 分类器，根据结果确定 $x_{t+1}^\prime$ 。