Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context（论文解读）

最新推荐文章于 2023-12-03 17:37:12 发布

K24B;

最新推荐文章于 2023-12-03 17:37:12 发布

阅读量165

点赞数

分类专栏：自然语言处理论文精读文章标签： transformer 语言模型深度学习

本文链接：https://blog.csdn.net/weixin_64017116/article/details/133177447

版权

自然语言处理同时被 2 个专栏收录

12 篇文章 0 订阅

订阅专栏

论文精读

4 篇文章 0 订阅

订阅专栏

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Background

序列模型捕捉长期依赖的能力在任何NLP任务上面都至关重要，LSTM通过门机制将RNN的长期依赖的捕获能力提升到了200个词左右，进一步地，transformer的提出又增强了这一能力。但是transformer捕捉长期依赖的能力是无限的吗，显然不是的。

这篇文章依然是围绕序列模型捕捉长期依赖这一核心展开的。

Abstract

We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length with out disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Our method not only enables capturing longer-termdependency, but also resolves the context fragmentation problem.

Transformer-XL 的核心算法包括两个部分：

片段递归机制（segment-level recurrence mechanism ）
相对位置编码机制（relative positional encoding）

Transformer-XL 带来的提升包括：

提升了序列的长期依赖能力（capturing longer-termdependency）
解决了上下文碎片问题（context segmentation problem）
提升模型的预测速度和准确率。

Model

NLP相关任务都不可避免的处理输入变长的情况，一般有两种方法来解决这个问题：第一就是将数据输入到类似前馈神经网络这样的model，得到一个固定的特征向量，但这种方法往往计算量大，难以执行。另一种方法就是把输入的数据按照切断或者padding的方式分成固定的一段一段。Transform采用的就是第二种方案，这个长度L一般为512。

3.1 Vanilla Transformer Language Models

Vanilla Transformer是 Transformer的变形但基本和Transformer差不多。他采用的方式便是上面提到的第二种。

但是这种方式在Train Phase有两个缺陷：

First, the largest possible dependency length is upper bounded by the segment length, which is a few hundred on character-level language modeling.

Second, though it is possible to use padding to respect the sentence or other semantic boundaries, in practice it has been standard practice to simply chunk long text into fixed-length segments due to improved efficiency. However, simply chunking a sequence into fixed-length segments will lead to the context fragmentation problem as discussed in Section 1.

第一就是上下文的依赖长度较短。取决于你的段长L。

第二就是将长文本分成一段一段的，就会造成上下文碎片化问题。

在Evaluation phase也存在问题

During evaluation, at each step, the vanilla model also consumes a segment of the same length as in training, but only makes one prediction at the last position. Then, at the next step, the segment is shifted to the right by only one position, and the new segment has to be processed all from scratch.

这种方式虽然解决了model在Train Phase的缺陷，但也存在很大的缺点— 评估过程是extremely expensive。
在这里插入图片描述

我们的Transformer-XL就是为了解决上述model存在的问题：分别采用片段递归机制（segment-level recurrence mechanism ）和（relative positional encoding）。

在这之前我们先来讲一下什么是绝对位置编码和相对位置编码。

绝对位置编码

在这里插入图片描述

相对位置编码

在这里插入图片描述

3.2 Segment-Level Recurrence with State Reuse

在Training Phase中，和transformer相同的是，也是一段一段的进行训练，不过引入了递归机制，也就是当前段的训练状态会被缓存下来，当训练下一段的额时候会使用当前段隐藏层的状态，这就克服了transformer不能够长期依赖的问题，比如看图：
在这里插入图片描述

很明显，与transformer训练不同的是，段与段之间并没有失去关联，当前段使用了上一段的隐层状态，既增强了长期依赖的能力也不会造成上下文碎片化的问题。

具体的数学公式为：
在这里插入图片描述

在Evaluation Phase中

Specifically, during evaluation, the representations from the previous segments can be reused instead of being computed from scratch as in the case of the vanilla model.

Vanilla Transformer 在评估的时候，每次只往右边移动一个token，就要从头开始算，而Transformer-XL重复利用了上一段，而不用重头计算，而且每次向右移动的距离也是以段位单位。显著提高了效率。
在这里插入图片描述