Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

最新推荐文章于 2023-12-03 17:37:12 发布

刘皮狠

最新推荐文章于 2023-12-03 17:37:12 发布

阅读量146

点赞数

分类专栏：论文阅读 NLP 文章标签： transformer 语言模型深度学习

本文链接：https://blog.csdn.net/weixin_43938099/article/details/128079723

版权

NLP 同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

论文阅读

10 篇文章 0 订阅

订阅专栏

Transormer-XL

主要工作

研究问题

固定长度的上下文设置对于模型的限制。

文章贡献

介绍一种segment级别的递归机制。
提出一种新的位置编码方式。

模型介绍

Vanilla Transformer

为解决问题：如何将任意长度的上下文输入表示为有限大小的上下文。

方法1：给一个无限内存和计算能力，用一个无条件的Transformer处理整个上下文序列，类似于前馈神经网络。

方法2：而可行的方法是划分整个语料库为更短的可控制的片段大小，并且只在各序列上训练模型，忽略先前划分的上下文信息。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ROLBG3xD-1669618280242)(C:\Users\Liu\Desktop\Learning\研究生\截图\image-20221107144415043.png)]

存在问题：无论在前向还是后向中都无法获取片段划分之间的信息。

局限性1：模型最大依赖长度受限于划分长度。

局限性2：简单地将序列划分为固定长度的片段会导致上下文碎片化问题。

在评估阶段，使用类似滑动窗口的方法进行预测，每次滑动只预测一个word。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Zcp1pcvw-1669618280243)(C:\Users\Liu\Desktop\Learning\研究生\截图\image-20221107144527713.png)]

Segment-level Recurrence with state reuse

解决问题：使用有限长度上下文的局限性。

方法：在Transformer结构中引入循环机制。

在训练阶段，将前一段计算的隐藏状态序列固定并缓存，作为模型处理下一个新片段的扩展上下文。这种方法可以使模型利用先前片段的信息。

计算方式：
$\begin{aligned} &\widetilde{\mathbf{h}}_{\tau+1}^{n-1}=\left[\mathrm{SG}\left(\mathbf{h}_\tau^{n-1}\right) \circ \mathbf{h}_{\tau+1}^{n-1}\right] \\ &\mathbf{q}_{\tau+1}^n, \mathbf{k}_{\tau+1}^n, \mathbf{v}_{\tau+1}^n=\mathbf{h}_{\tau+1}^{n-1} \mathbf{W}_q^{\top}, \widetilde{\mathbf{h}}_{\tau+1}^{n-1} \mathbf{W}_k^{\top}, \widetilde{\mathbf{h}}_{\tau+1}^{n-1} \mathbf{W}_v^{\top}, \\ &\mathbf{h}_{\tau+1}^n=\text { Transformer-Layer }\left(\mathbf{q}_{\tau+1}^n, \mathbf{k}_{\tau+1}^n, \mathbf{v}_{\tau+1}^n\right) . \end{aligned}$
其中：

$n$ ：表示第 $n$ 层；
$h$ ：表示隐藏状态；
$S G (\cdot)$ ：表示停止梯度；
$[h_u\circ h_v]$ ：表示两个隐藏层之间的拼接操作；

与Transformer最大的区别在于 $k, v$ 使用了扩展的上下文信息。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2eXeJBdd-1669618280244)(C:\Users\Liu\Desktop\Learning\研究生\截图\image-20221107145720242.png)]

在评估阶段，可以重用来自前面的片段表示，而不是重新计算。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tlFCcS3U-1669618280244)(C:\Users\Liu\Desktop\Learning\研究生\截图\image-20221107150152225.png)]

relative positional encoding

存在问题：如果采用Transformer中的位置编码方式，那么对于 $s_\tau$ 与 $s_{\tau+1}$ ，模型无法区分其位置信息的区别。公式如下：
$\begin{aligned} \mathbf{h}_{\tau+1} &=f\left(\mathbf{h}_\tau, \mathbf{E}_{\mathbf{s}_{\tau+1}}+\mathbf{U}_{1: L}\right) \\ \mathbf{h}_\tau &=f\left(\mathbf{h}_{\tau-1}, \mathbf{E}_{\mathbf{s}_\tau}+\mathbf{U}_{1: L}\right) \end{aligned}$
解决方法：只在隐藏状态中编码相对位置信息。

将相同的信息注入到每一层的注意力分数中，而不是将bias静态地加入到初始嵌入中。
对于 $q_{\tau,i}$ 与 $k_{\tau,j}$ ，我们只关注他们的相对位置，例如 $i - j$ 。
绝对位置信息可以从相对位置信息中递归恢复。

query向量 $q_i$ 与key向量 $k_j$ 之间的注意分数可以分解为:
$\begin{aligned} \mathbf{A}_{i, j}^{\mathrm{abs}} &=\underbrace{\mathbf{E}_{x_i}^{\top} \mathbf{W}_q^{\top} \mathbf{W}_k \mathbf{E}_{x_j}}_{(a)}+\underbrace{\mathbf{E}_{x_i}^{\top} \mathbf{W}_q^{\top} \mathbf{W}_k \mathbf{U}_j}_{(b)} +\underbrace{\mathbf{U}_i^{\top} \mathbf{W}_q^{\top} \mathbf{W}_k \mathbf{E}_{x_j}}_{(c)}+\underbrace{\mathbf{U}_i^{\top} \mathbf{W}_q^{\top} \mathbf{W}_k \mathbf{U}_j}_{(d)} . \end{aligned}$
使用相对位置编码后，分数可以表示为：
$\begin{aligned} \mathbf{A}_{i, j}^{\mathrm{rel}} &=\underbrace{\mathbf{E}_{x_i}^{\top} \mathbf{W}_q^{\top} \mathbf{W}_{k, E} \mathbf{E}_{x_j}}_{(a)}+\underbrace{\mathbf{E}_{x_i}^{\top} \mathbf{W}_q^{\top} \mathbf{W}_{k, R} \mathbf{R}_{i-j}}_{(b)} +\underbrace{u^{\top} \mathbf{W}_{k, E} \mathbf{E}_{x_j}}_{(c)}+\underbrace{v^{\top} \mathbf{W}_{k, R} \mathbf{R}_{i-j}}_{(d)} . \end{aligned}$

$R_{i-j}$ 表示相对位置编码。

其中的变化有：

使用相对位置编码 $R_{i-j}$ 替换绝对位置编码 $U_j$ ；
使用可训练的参数 $u, v$ 代替query $U_i^\top W_q^\top$ ；
分离 $W_{k,E}$ 和 $W_{k,R}$ 分别用于生成基于内容的key向量和基于位置的key向量。

公式中每个部分的含义：

$(a)$ ：表示基于内容的地址；
$(b)$ ：捕获依赖相关的位置bias；
$(c)$ ：控制全局的内容bias；
$(d)$ ：编码全局的位置bias；

加上相对位置编码后，Transformer-XL的attention机制：
$\begin{aligned} \widetilde{\mathbf{h}}_\tau^{n-1}=& {\left[\mathrm{SG}\left(\mathbf{m}_\tau^{n-1}\right) \circ \mathbf{h}_\tau^{n-1}\right] } \\ \mathbf{q}_\tau^n, \mathbf{k}_\tau^n, \mathbf{v}_\tau^n=& \mathbf{h}_\tau^{n-1} \mathbf{W}_q^{n \top}, \widetilde{\mathbf{h}}_\tau^{n-1} \mathbf{W}_{k, E}^n{ }^{\top}, \widetilde{\mathbf{h}}_\tau^{n-1} \mathbf{W}_v^{n \top} \\ \mathbf{A}_{\tau, i, j}^n=& \mathbf{q}_{\tau, i}^n{ }^{\top} \mathbf{k}_{\tau, j}^n+\mathbf{q}_{\tau, i}^n{ }^{\top} \mathbf{W}_{k, R}^n \mathbf{R}_{i-j} \\ &+u^{\top} \mathbf{k}_{\tau, j}+v^{\top} \mathbf{W}_{k, R}^n \mathbf{R}_{i-j} \\ \mathbf{a}_\tau^n=& \text { Masked-Softmax }\left(\mathbf{A}_\tau^n\right) \mathbf{v}_\tau^n \\ \mathbf{o}_\tau^n=& \text { LayerNorm }\left(\text { Linear }\left(\mathbf{a}_\tau^n\right)+\mathbf{h}_\tau^{n-1}\right) \\ \mathbf{h}_\tau^n=& \text { Positionwise-Feed-Forward }\left(\mathbf{o}_\tau^n\right) \end{aligned}$
其中