Abstract
Transformer 结构具有学习长距离依赖的潜力,但在 language modeling 中受限于固定长度的上下文。本文提出的 Transformer-XL(extra long) 架构可以可以突破此限制,在保持连贯性的同时学到更长距离的依赖。Transformer-XL 主要有两个创新:
- introduce segment-level recurrence mechanism into deep self-attention network
- introduce noval relative positional encoding scheme
实现的效果主要有:
- 捕获更长距离的依赖(80% longer than RNNs and 450% longer than vanilla Transformers)
- 解决 context fragmentation 问题
- evaluation speed up (1800+ times faster than vanilla Transformers)
注:论文中用于对比的 vanilla Transf