!DOCTYPE html>
<link href="https://csdnimg.cn/public/favicon.ico" rel="SHORTCUT ICON">
<title>Transformer-XL解读(论文 + PyTorch源码) - Magical_Bubble的博客 - CSDN博客</title>
<link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/detail-c278e9b7fe.min.css">
<script type="application/ld+json">{"@context":"https:\/\/ziyuan.baidu.com\/contexts\/cambrian.jsonld","@id":"https:\/\/blog.csdn.net\/magical_bubble\/article\/details\/89060213","appid":"1563894916825412","title":"Transformer-XL\u89e3\u8bfb\uff08\u8bba\u6587 + PyTorch\u6e90\u7801\uff09 - Magical_Bubble\u7684\u535a\u5ba2","images":["https:\/\/img-blog.csdnimg.cn\/20190407095343453.png","https:\/\/img-blog.csdnimg.cn\/20190407095512873.png","https:\/\/img-blog.csdnimg.cn\/20190407095601191.png"],"pubDate":"2019-07-24T11:31:25"}</script>
<link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/themes/skin3-template/skin3-template-9b39979775.min.css">
<script type="text/javascript">
var username = "Magical_Bubble";
var blog_address = "https://blog.csdn.net/magical_bubble";
var static_host = "https://csdnimg.cn/release/phoenix/";
var currentUserName = "w344674";
var isShowAds = true;
var isOwner = false;
var loginUrl = "http://passport.csdn.net/account/login?from=https://blog.csdn.net/magical_bubble/article/details/89060213"
var blogUrl = "https://blog.csdn.net/";
var curSkin = "skin3-template";
// 第四范式所需数据
var articleTitles = "Transformer-XL解读(论文 + PyTorch源码) - Magical_Bubble的博客";
var nickName = "MagicBubble";
var isCorporate = false;
var subDomainBlogUrl = "https://blog.csdn.net/"
var digg_base_url = "https://blog.csdn.net/magical_bubble/phoenix/comment";
var articleDetailUrl = "https://blog.csdn.net/Magical_Bubble/article/details/89060213";
</script>
<script src="https://csdnimg.cn/public/common/libs/jquery/jquery-1.9.1.min.js" type="text/javascript"></script>
<script src="//g.csdnimg.cn/??fixed-sidebar/1.1.3/fixed-sidebar.js,report/1.0.2/report.js" type="text/javascript"></script>
<link rel="stylesheet" href="https://csdnimg.cn/public/sandalstrap/1.4/css/sandalstrap.min.css">
<style>
.MathJax, .MathJax_Message, .MathJax_Preview{
display: none
}
</style>
Transformer-XL解读(论文 + PyTorch源码)
前言
目前在NLP领域中,处理语言建模问题有两种最先进的架构:RNN和Transformer。RNN按照序列顺序逐个学习输入的单词或字符之间的关系,而Transformer则接收一整段序列,然后使用self-attention机制来学习它们之间的依赖关系。这两种架构目前来看都取得了令人瞩目的成就,但它们都局限在捕捉长期依赖性上。
为了解决这一问题,CMU联合Google Brain在2019年1月推出的一篇新论文《Transformer-XL:Attentive Language Models beyond a Fixed-Length Context》同时结合了RNN序列建模和Transformer自注意力机制的优点,在输入数据的每个段上使用Transformer的注意力模块,并使用循环机制来学习连续段之间的依赖关系。Transformer-XL在多种语言建模数据集(如单词级别的enwik8和字符级别的text8)上实现了目前的SoTA效果,且该模型在推理阶段速度更快,比之前最先进的利用Transformer进行语言建模的方法快300~1800倍。 同时,该论文也放出了其配套源码(包括TensorFlow和PyTorch的)、预训练模型及在各个数据集上训练的超参数,可以说是非常良心了~造福我等伸手党!
本文将主要针对模型原理及其PyTorch实现进行逐一对照解读,因笔者能力有限,如有不详尽之处,可移步文末的传送门进行详细阅读,并欢迎指出~
文章目录
一. 回顾Transformer
在NLP领域中,一种对语言建模的最常用模型就是RNN,它可以捕捉单词之间的依赖关系。但因为梯度消失和爆炸的问题,RNN变得非常难以训练,LSTM单元和梯度裁剪方法的提出也不足以解决此类问题。同时RNN网络的计算速度往往很慢,其学习长期依赖的能力也较为有限(论文中提到,LSTM语言模型平均只能建模200个上下文词语)。
2017年6月,Google Brain在论文《Attention Is All You Need》中提出的Transformer架构,完全摒弃了RNN的循环机制,采用一种self-attention的方式进行全局处理。其接收一整段序列,并使用三个可训练的权重矩阵——Query、Key和Value来一次性学习输入序列中各个部分之间的依赖关系。Transformer网络由多个层组成,每个层都由多头注意力机制和前馈网络构成。由于在全局进行注意力机制的计算,忽略了序列中最重要的位置信息。Transformer为输入添加了位置编码(Positional Encoding),使用正弦函数完成,为每个部分的位置生成位置向量,不需要学习,用于帮助网络学习其位置信息。其示意如下图所示:
有关Transformer的更深入讨论,可参考笔者之前的博客:
二. vanilla Transformer
为何要提这个模型?因为Transformer-XL是基于这个模型进行的改进。
Al-Rfou等人基于Transformer提出了一种训练语言模型的方法( https://arxiv.org/abs/1808.04444 ),来根据之前的字符预测片段中的下一个字符。例如,它使用
x
1
,
x
2
,
.
.
.
,
x
n
−
1
x
1
,
x
2
,
.
.
.
,
x
n
−
1
x
1
,
x
2
,
.
.
.
,
x
n
−
1
x1,x2,...,xn−1x_1, x_2, ..., x_{n-1}x1,x2,...,xn−1
x1,x2,...,xn−1x1,x2,...,xn−1x1,x2,...,xn−1预测字符
x
n
x
n
x
n
xnx_nxn
xnxnxn,而在
x
n
x
n
x
n
xnx_nxn
xnxnxn之后的序列则被mask掉。论文中使用64层模型,并仅限于处理 512个字符这种相对较短的输入,因此它将输入分成段,并分别从每个段中进行学习,如下图所示。 在测试阶段如需处理较长的输入,该模型会在每一步中将输入向右移动一个字符,以此实现对单个字符的预测。
该模型在常用的数据集如enwik8和text8上的表现比RNN模型要好,但它仍有以下两个缺点:
a. 上下文长度受限:字符之间的最大依赖距离受输入长度的限制,模型看不到出现在几个句子之前的单词。
b. 上下文碎片:对于长度超过512个字符的文本,都是从头开始单独训练的。段与段之间没有上下文依赖性,会让训练效率低下,也会影响模型的性能。
c. 推理速度慢:在测试阶段,每次预测下一个单词,都需要重新构建一遍上下文,并从头开始计算,这样的计算速度非常慢。
三. Transformer-XL
Transformer-XL架构在vanilla Transformer的基础上引入了两点创新:循环机制(Recurrence Mechanism)和相对位置编码(Relative Positional Encoding),以克服vanilla Transformer的缺点。与vanilla Transformer相比,Transformer-XL的另一个优势是它可以被用于单词级和字符级的语言建模。
1. 引入循环机制
与vanilla Transformer的基本思路一样,Transformer-XL仍然是使用分段的方式进行建模,但其与vanilla Transformer的本质不同是在于引入了段与段之间的循环机制,使得当前段在建模的时候能够利用之前段的信息来实现长期依赖性。如下图所示:
在训练阶段,处理后面的段时,每个隐藏层都会接收两个输入:
- 该段的前面隐藏层的输出,与vanilla Transformer相同(上图的灰色线)。
- 前面段的隐藏层的输出(上图的绿色线),可以使模型创建长期依赖关系。
这两个输入会被拼接,然后用于计算当前段的Key和Value矩阵。对于某个段的某一层的具体计算公式如下:
其中,
τ
τ
τ
τ\tauτ
τττ表示第几段,
n
n
n
nnn
nnn表示第几层,
h
h
h
hhh
hhh表示隐层的输出。
S
G
(
⋅
)
S
G
(
⋅
)
S
G
(
⋅
)
SG(⋅)SG(·)SG(⋅)
SG(⋅)SG(⋅)SG(⋅)表示停止计算梯度,
[
h
u
∘
h
v
]
[
h
u
∘
h
v
]
[
h
u
∘
h
v
]
[hu∘hv][h_u \circ h_v][hu∘hv]
[hu∘hv][hu∘hv][hu∘hv]表示在长度维度上的两个隐层的拼接,
W
.
W
.
W
.
W.W_.W.
W.W.W.是模型参数。乍一看与Transformer中的计算公式很像,唯一关键的不同就在于Key和Value矩阵的计算上,即
k
τ
+
1
n
k
τ
+
1
n
k
τ
+
1
n
kτ+1nk_{\tau+1}^nkτ+1n
kτ+1nkτ+1nkτ+1n和
v
τ
+
1
n
v
τ
+
1
n
v
τ
+
1
n
vτ+1nv_{\tau + 1}^nvτ+1n
vτ+1nvτ+1nvτ+1n,它们基于的是扩展后的上下文隐层状态
h
τ
+
1
n
−
1
h
~
τ
+
1
n
−
1
h
τ
+
1
n
−
1
h~τ+1n−1\tilde{h}_{\tau+1}^{n-1}h~τ+1n−1
h τ+1n−1h~τ+1n−1h τ+1n−1进行计算,
h
τ
n
−
1
h
τ
n
−
1
h
τ
n
−
1
hτn−1{h}_{\tau}^{n-1}hτn−1
hτn−1hτn−1hτn−1是之前段的缓存。
原则上只要GPU内存允许,该方法可以利用前面更多段的信息,测试阶段也可以获得更长的依赖。
在测试阶段,与vanilla Transformer相比,其速度也会更快。在vanilla Transformer中,一次只能前进一个step,并且需要重新构建段,并全部从头开始计算;而在Transformer-XL中,每次可以前进一整个段,并利用之前段的数据来预测当前段的输出。
2. 相对位置编码
在Transformer中,一个重要的地方在于其考虑了序列的位置信息。在分段的情况下,如果仅仅对于每个段仍直接使用Transformer中的位置编码,即每个不同段在同一个位置上的表示使用相同的位置编码,就会出现问题。比如,第 i − 2 i − 2 i − 2 i−2i-2i−2 i−2i−2i−2段和第 i − 1 i − 1 i − 1 i−1i-1i−1 i−1i−1i−1段的第一个位置将具有相同的位置编码,但它们对于第 i i i iii iii段的建模重要性显然并不相同(例如第 i − 2 i − 2 i − 2 i−2i-2i−2 i−2i−2i−2段中的第一个位置重要性可能要低一些)。因此,需要对这种位置进行区分。
论文对于这个问题,提出了一种新的位置编码的方式,即会根据词之间的相对距离而非像Transformer中的绝对位置进行编码。在Transformer中,第一层的计算查询
q
i
T
q
i
T
q
i
T
qiTq_i^TqiT
qiTqiTqiT和键
k
j
k
j
k
j
kjk_jkj
kjkjkj之间的attention分数的方式为:
其中,
E
x
i
E
x
i
E
x
i
ExiE_{x_i}Exi
ExiExiExi是词
i
i
i
iii
iii的embedding,
E
x
j
E
x
j
E
x
j
ExjE_{x_j}Exj
ExjExjExj是词
j
j
j
jjj
jjj的embedding,
U
i
U
i
U
i
UiU_iUi
UiUiUi和
U
j
U
j
U
j
UjU_jUj
UjUjUj是位置向量,这个式子实际上是
(
W
q
(
E
x
i
+
U
i
)
)
T
⋅
(
W
k
(
E
x
j
+
U
j
)
)
(
W
q
(
E
x
i
+
U
i
)
)
T
⋅
(
W
k
(
E
x
j
+
U
j
)
)
(
W
q
(
E
x
i
+
U
i
)
)
T
⋅
(
W
k
(
E
x
j
+
U
j
)
)
(Wq(Exi+Ui))T⋅(Wk(Exj+Uj))(W_q(E_{x_i}+U_i))^T·(W_k(E_{x_j}+U_j))(Wq(Exi+Ui))T⋅(Wk(Exj+Uj))
(Wq(Exi+Ui))T⋅(Wk(Exj+Uj))(Wq(Exi+Ui))T⋅(Wk(Exj+Uj))(Wq(Exi+Ui))T⋅(Wk(Exj+Uj))的展开,就是Transformer中的标准格式。
在Transformer-XL中,对上述的attention计算方式进行了变换,转为相对位置的计算,而且不仅仅在第一层这么计算,在每一层都是这样计算。
对比来看,主要有三点变化:
- 在(b)和(d)这两项中,将所有绝对位置向量 U j U j U j UjU_jUj UjUjUj都转为相对位置向量 R i − j R i − j R i − j Ri−jR_{i-j}Ri−j Ri−jRi−jRi−j,与Transformer一样,这是一个固定的编码向量,不需要学习。
- 在(c)这一项中,将查询的 U i T W q T U i T W q T U i T W q T UiTWqTU_i^TW_q^TUiTWqT UiTWqTUiTWqTUiTWqT向量转为一个需要学习的参数向量 u u u uuu uuu,因为在考虑相对位置的时候,不需要查询的绝对位置 i i i iii iii,因此对于任意的 i i i iii iii,都可以采用同样的向量。同理,在(d)这一项中,也将查询的 U i T W q T U i T W q T U i T W q T UiTWqTU_i^TW_q^TUiTWqT UiTWqTUiTWqTUiTWqT向量转为另一个需要学习的参数向量 v v v vvv vvv。
- 将键的权重变换矩阵 W k W k W k WkW_kWk WkWkWk转为 W k , E W k , E W k , E Wk,EW_{k, E}Wk,E Wk,EWk,EWk,E和 W k , R W k , R W k , R Wk,RW_{k, R}Wk,R Wk,RWk,RWk,R,分别作为content-based key vectors和location-based key vectors。
从另一个角度来解读这个公式的话,可以将attention的计算分为如下四个部分:
a. 基于内容的“寻址”,即没有添加原始位置编码的原始分数。
b. 基于内容的位置偏置,即相对于当前内容的位置偏差。
c. 全局的内容偏置,用于衡量key的重要性。
d. 全局的位置偏置,根据query和key之间的距离调整重要性。
3. 整体计算公式
结合上面两个创新点,将Transformer-XL模型的整体计算公式整理如下,这里考虑一个N层的只有一个注意力头的模型:
其中,
τ
τ
τ
τ\tauτ
τττ代表第几段,
n
n
n
nnn
nnn代表第几层,
h
τ
0
:
=
E
s
τ
h
τ
0
:
=
E
s
τ
h
τ
0
:
=
E
s
τ
hτ0:=Esτh_\tau^0 := E_{s_\tau}hτ0:=Esτ
hτ0:=Esτhτ0:=Esτhτ0:=Esτ定义为第
τ
τ
τ
τ\tauτ
τττ段的词向量序列。值得一提的是,计算
A
A
A
AAA
AAA矩阵的时候,需要对所有的
i
−
j
i
−
j
i
−
j
i−ji-ji−j
i−ji−ji−j计算
W
k
,
R
n
R
i
−
j
W
k
,
R
n
R
i
−
j
W
k
,
R
n
R
i
−
j
Wk,RnRi−jW_{k,R}^nR_{i-j}Wk,RnRi−j
Wk,RnRi−jWk,RnRi−jWk,RnRi−j,如果直接按照公式计算的话,计算时间是
O
(
l
e
n
g
t
h
)
2
O
(
l
e
n
g
t
h
)
2
O
(
l
e
n
g
t
h
)
2
O(length)2O(length)^2O(length)2
O(length)2O(length)2O(length)2,而实际上
i
−
j
i
−
j
i
−
j
i−ji-ji−j
i−ji−ji−j的范围只从0 ~ length,因此可以先计算好这length个向量,然后在实际计算
A
A
A
AAA
AAA矩阵时直接取用即可。
具体的,设
M
M
M
MMM
MMM和
L
L
L
LLL
LLL分别为memory和当前段序列的长度,则
i
−
j
i
−
j
i
−
j
i−ji-ji−j
i−ji−ji−j的范围也就为0 ~
M
+
L
−
1
M
+
L
−
1
M
+
L
−
1
M+L−1M + L - 1M+L−1
M+L−1M+L−1M+L−1。下面的
Q
Q
Q
QQQ
QQQ矩阵中的每一行都代表着
W
k
,
R
R
i
−
j
W
k
,
R
R
i
−
j
W
k
,
R
R
i
−
j
Wk,RRi−jW_{k,R}R_{i-j}Wk,RRi−j
Wk,RRi−jWk,RRi−jWk,RRi−j中一个
i
−
j
i
−
j
i
−
j
i−ji-ji−j
i−ji−ji−j的可能性,即
Q
k
=
W
k
,
R
R
M
+
L
−
1
−
k
Q
k
=
W
k
,
R
R
M
+
L
−
1
−
k
Q
k
=
W
k
,
R
R
M
+
L
−
1
−
k
Qk=Wk,RRM+L−1−kQ_k = W_{k, R} R_{M+L-1-k}Qk=Wk,RRM+L−1−k
Qk=Wk,RRM+L−1−kQk=Wk,RRM+L−1−kQk=Wk,RRM+L−1−k。
则对于上面公式中的(b)项,即
q
i
T
W
k
,
R
R
i
−
j
q
i
T
W
k
,
R
R
i
−
j
q
i
T
W
k
,
R
R
i
−
j
qiTWk,RRi−jq_i^TW_{k,R}R_{i-j}qiTWk,RRi−j
qiTWk,RRi−jqiTWk,RRi−jqiTWk,RRi−j,其构成的所有可能向量的矩阵为
B
B
B
BBB
BBB矩阵,其形状为
L
∗
(
M
+
L
)
L
∗
(
M
+
L
)
L
∗
(
M
+
L
)
L∗(M+L)L * (M + L)L∗(M+L)
L∗(M+L)L∗(M+L)L∗(M+L),这是我们最终需要的(b)项的attention结果。
我们进一步定义
B
B
~
B
B~\tilde{B}B~
B B~B 矩阵为如下:
可见,需要的
B
B
B
BBB
BBB矩阵的每一行只是
B
B
~
B
B~\tilde{B}B~
B B~B 的向左shift而已。因此,可以直接利用矩阵乘法计算
B
B
~
B
B~\tilde{B}B~
B B~B 即可。设
R
i
−
j
R
i
−
j
R
i
−
j
Ri−jR_{i-j}Ri−j
Ri−jRi−jRi−j的维度为
d
R
d
R
d
R
dRd_RdR
dRdRdR,
q
i
q
i
q
i
qiq_iqi
qiqiqi的维度为
d
q
d
q
d
q
dqd_qdq
dqdqdq,
W
k
,
R
W
k
,
R
W
k
,
R
Wk,RW_{k,R}Wk,R
Wk,RWk,RWk,R矩阵的维度为
d
q
∗
d
R
d
q
∗
d
R
d
q
∗
d
R
dq∗dRd_q * d_Rdq∗dR
dq∗dRdq∗dRdq∗dR,则直接计算矩阵B的时间复杂度为
2
∗
d
q
∗
d
R
∗
L
∗
(
M
+
L
)
2
∗
d
q
∗
d
R
∗
L
∗
(
M
+
L
)
2
∗
d
q
∗
d
R
∗
L
∗
(
M
+
L
)
2∗dq∗dR∗L∗(M+L)2* d_q * d_R * L * (M+L)2∗dq∗dR∗L∗(M+L)
2∗dq∗dR∗L∗(M+L)2∗dq∗dR∗L∗(M+L)2∗dq∗dR∗L∗(M+L),而计算
B
B
~
B
B~\tilde{B}B~
B B~B 的时间复杂度为
L
∗
d
q
∗
(
M
+
L
)
+
d
q
∗
d
R
∗
(
M
+
L
)
L
∗
d
q
∗
(
M
+
L
)
+
d
q
∗
d
R
∗
(
M
+
L
)
L
∗
d
q
∗
(
M
+
L
)
+
d
q
∗
d
R
∗
(
M
+
L
)
L∗dq∗(M+L)+dq∗dR∗(M+L)L * d_q * (M + L) + d_q * d_R * (M + L)L∗dq∗(M+L)+dq∗dR∗(M+L)
L∗dq∗(M+L)+dq∗dR∗(M+L)L∗dq∗(M+L)+dq∗dR∗(M+L)L∗dq∗(M+L)+dq∗dR∗(M+L),计算量明显不是一个量级(后者要快很多)。
同理,对于(d)项来说,可以对所有的
i
−
j
i
−
j
i
−
j
i−ji-ji−j
i−ji−ji−j定义需要的矩阵
D
D
D
DDD
DDD为
L
∗
(
M
+
L
)
L
∗
(
M
+
L
)
L
∗
(
M
+
L
)
L∗(M+L)L * (M+L)L∗(M+L)
L∗(M+L)L∗(M+L)L∗(M+L):
可以用如下的
d
d
~
d
d~\tilde{d}d~
d d~d 来进行shift得到:
其中
Q
Q
Q
QQQ
QQQ矩阵已经计算过了,也可以在这一步减少计算量。
四. PyTorch实现
笔者在这里主要研究的是核心模型部分,将针对关键的实现细节进行剖析,想要看完整代码的读者请戳这里。
- 首先来看RelativePositionalEmbedding部分。
class PositionalEmbedding(nn.Module): def __init__(self, demb): super(PositionalEmbedding, self).__init__() self.demb = demb inv_freq = 1 / (10000 ** (torch.arange(0.0, demb, 2.0) / demb))
<span class="token keyword">def</span> <span class="token function">forward</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> pos_seq<span class="token punctuation">)</span><span class="token punctuation">:</span> sinusoid_inp <span class="token operator">=</span> torch<span class="token punctuation">.</span>ger<span class="token punctuation">(</span>pos_seq<span class="token punctuation">,</span> self<span class="token punctuation">.</span>inv_freq<span class="token punctuation">)</span> pos_emb <span class="token operator">=</span> torch<span class="token punctuation">.</span>cat<span class="token punctuation">(</span><span class="token punctuation">[</span>sinusoid_inp<span class="token punctuation">.</span>sin<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> sinusoid_inp<span class="token punctuation">.</span>cos<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">]</span><span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token keyword">return</span> pos_emb<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token boolean">None</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">]</span>
这里的demb
是相对位置编码的维度,pos_seq
是序列的位置向量,在代码里面是torch.arange(klen-1, -1, -1.0)
,其中的klen
是mlen+qlen
,从名称和之前的原理介绍可知这里的mlen
是memory的长度,qlen
是query的长度,这两者组成了key的长度。最终返回的即是
R
R
R
RRR
RRR向量矩阵,可见是不需要学习的。
- 接着来看MultiHeadAttention的部分,为了叙述方便,这里的MultiHeadAttn是源代码中的RelMultiHeadAttn和RelPartialLearnableMultiHeadAttn的整合,也即一层self-attention的计算方式。
class MultiHeadAttn(nn.Module): def __init__(self, n_head, d_model, d_head, dropout, dropatt=0, tgt_len=None, ext_len=None, mem_len=None, pre_lnorm=False): super(MultiHeadAttn, self).__init__()
self<span class="token punctuation">.</span>n_head <span class="token operator">=</span> n_head self<span class="token punctuation">.</span>d_model <span class="token operator">=</span> d_model self<span class="token punctuation">.</span>d_head <span class="token operator">=</span> d_head self<span class="token punctuation">.</span>dropout <span class="token operator">=</span> dropout self<span class="token punctuation">.</span>qkv_net <span class="token operator">=</span> nn<span class="token punctuation">.</span>Linear<span class="token punctuation">(</span>d_model<span class="token punctuation">,</span> <span class="token number">3</span> <span class="token operator">*</span> n_head <span class="token operator">*</span> d_head<span class="token punctuation">,</span> bias<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>drop <span class="token operator">=</span> nn<span class="token punctuation">.</span>Dropout<span class="token punctuation">(</span>dropout<span class="token punctuation">)</span> self<span class="token punctuation">.</span>dropatt <span class="token operator">=</span> nn<span class="token punctuation">.</span>Dropout<span class="token punctuation">(</span>dropatt<span class="token punctuation">)</span> self<span class="token punctuation">.</span>o_net <span class="token operator">=</span> nn<span class="token punctuation">.</span>Linear<span class="token punctuation">(</span>n_head <span class="token operator">*</span> d_head<span class="token punctuation">,</span> d_model<span class="token punctuation">,</span> bias<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>layer_norm <span class="token operator">=</span> nn<span class="token punctuation">.</span>LayerNorm<span class="token punctuation">(</span>d_model<span class="token punctuation">)</span> self<span class="token punctuation">.</span>scale <span class="token operator">=</span> <span class="token number">1</span> <span class="token operator">/</span> <span class="token punctuation">(</span>d_head <span class="token operator">**</span> <span class="token number">0.5</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>pre_lnorm <span class="token operator">=</span> pre_lnorm self<span class="token punctuation">.</span>r_net <span class="token operator">=</span> nn<span class="token punctuation">.</span>Linear<span class="token punctuation">(</span>self<span class="token punctuation">.</span>d_model<span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head <span class="token operator">*</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">,</span> bias<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">_rel_shift</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> x<span class="token punctuation">,</span> zero_triu<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span><span class="token punctuation">:</span> zero_pad <span class="token operator">=</span> torch<span class="token punctuation">.</span>zeros<span class="token punctuation">(</span><span class="token punctuation">(</span>x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token number">1</span><span class="token punctuation">,</span> <span class="token operator">*</span>x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> device<span class="token operator">=</span>x<span class="token punctuation">.</span>device<span class="token punctuation">,</span> dtype<span class="token operator">=</span>x<span class="token punctuation">.</span>dtype<span class="token punctuation">)</span> x_padded <span class="token operator">=</span> torch<span class="token punctuation">.</span>cat<span class="token punctuation">(</span><span class="token punctuation">[</span>zero_pad<span class="token punctuation">,</span> x<span class="token punctuation">]</span><span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span> x_padded <span class="token operator">=</span> x_padded<span class="token punctuation">.</span>view<span class="token punctuation">(</span>x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token number">1</span><span class="token punctuation">,</span> x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token operator">*</span>x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">2</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">)</span> x <span class="token operator">=</span> x_padded<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">:</span><span class="token punctuation">]</span><span class="token punctuation">.</span>view_as<span class="token punctuation">(</span>x<span class="token punctuation">)</span> <span class="token keyword">if</span> zero_triu<span class="token punctuation">:</span> ones <span class="token operator">=</span> torch<span class="token punctuation">.</span>ones<span class="token punctuation">(</span><span class="token punctuation">(</span>x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">)</span> x <span class="token operator">=</span> x <span class="token operator">*</span> torch<span class="token punctuation">.</span>tril<span class="token punctuation">(</span>ones<span class="token punctuation">,</span> x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token operator">-</span> x<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token boolean">None</span><span class="token punctuation">,</span><span class="token boolean">None</span><span class="token punctuation">]</span> <span class="token keyword">return</span> x <span class="token keyword">def</span> <span class="token function">forward</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> w<span class="token punctuation">,</span> r<span class="token punctuation">,</span> r_w_bias<span class="token punctuation">,</span> r_r_bias<span class="token punctuation">,</span> attn_mask<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">,</span> mems<span class="token operator">=</span><span class="token boolean">None</span><span class="token punctuation">)</span><span class="token punctuation">:</span> qlen<span class="token punctuation">,</span> rlen<span class="token punctuation">,</span> bsz <span class="token operator">=</span> w<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> r<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> w<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> <span class="token keyword">if</span> mems <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span><span class="token punctuation">:</span> cat <span class="token operator">=</span> torch<span class="token punctuation">.</span>cat<span class="token punctuation">(</span><span class="token punctuation">[</span>mems<span class="token punctuation">,</span> w<span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">)</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>pre_lnorm<span class="token punctuation">:</span> w_heads <span class="token operator">=</span> self<span class="token punctuation">.</span>qkv_net<span class="token punctuation">(</span>self<span class="token punctuation">.</span>layer_norm<span class="token punctuation">(</span>cat<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> w_heads <span class="token operator">=</span> self<span class="token punctuation">.</span>qkv_net<span class="token punctuation">(</span>cat<span class="token punctuation">)</span> r_head_k <span class="token operator">=</span> self<span class="token punctuation">.</span>r_net<span class="token punctuation">(</span>r<span class="token punctuation">)</span> w_head_q<span class="token punctuation">,</span> w_head_k<span class="token punctuation">,</span> w_head_v <span class="token operator">=</span> torch<span class="token punctuation">.</span>chunk<span class="token punctuation">(</span>w_heads<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> w_head_q <span class="token operator">=</span> w_head_q<span class="token punctuation">[</span><span class="token operator">-</span>qlen<span class="token punctuation">:</span><span class="token punctuation">]</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>pre_lnorm<span class="token punctuation">:</span> w_heads <span class="token operator">=</span> self<span class="token punctuation">.</span>qkv_net<span class="token punctuation">(</span>self<span class="token punctuation">.</span>layer_norm<span class="token punctuation">(</span>w<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> w_heads <span class="token operator">=</span> self<span class="token punctuation">.</span>qkv_net<span class="token punctuation">(</span>w<span class="token punctuation">)</span> r_head_k <span class="token operator">=</span> self<span class="token punctuation">.</span>r_net<span class="token punctuation">(</span>r<span class="token punctuation">)</span> w_head_q<span class="token punctuation">,</span> w_head_k<span class="token punctuation">,</span> w_head_v <span class="token operator">=</span> torch<span class="token punctuation">.</span>chunk<span class="token punctuation">(</span>w_heads<span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">)</span> klen <span class="token operator">=</span> w_head_k<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span> w_head_q <span class="token operator">=</span> w_head_q<span class="token punctuation">.</span>view<span class="token punctuation">(</span>qlen<span class="token punctuation">,</span> bsz<span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head<span class="token punctuation">,</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">)</span> <span class="token comment"># qlen x bsz x n_head x d_head</span> w_head_k <span class="token operator">=</span> w_head_k<span class="token punctuation">.</span>view<span class="token punctuation">(</span>klen<span class="token punctuation">,</span> bsz<span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head<span class="token punctuation">,</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">)</span> <span class="token comment"># qlen x bsz x n_head x d_head</span> w_head_v <span class="token operator">=</span> w_head_v<span class="token punctuation">.</span>view<span class="token punctuation">(</span>klen<span class="token punctuation">,</span> bsz<span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head<span class="token punctuation">,</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">)</span> <span class="token comment"># qlen x bsz x n_head x d_head</span> r_head_k <span class="token operator">=</span> r_head_k<span class="token punctuation">.</span>view<span class="token punctuation">(</span>rlen<span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head<span class="token punctuation">,</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">)</span> <span class="token comment"># qlen x n_head x d_head</span> <span class="token comment">#### compute attention score</span> rw_head_q <span class="token operator">=</span> w_head_q <span class="token operator">+</span> r_w_bias <span class="token comment"># qlen x bsz x n_head x d_head</span> AC <span class="token operator">=</span> torch<span class="token punctuation">.</span>einsum<span class="token punctuation">(</span><span class="token string">'ibnd,jbnd->ijbn'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>rw_head_q<span class="token punctuation">,</span> w_head_k<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># qlen x klen x bsz x n_head</span> rr_head_q <span class="token operator">=</span> w_head_q <span class="token operator">+</span> r_r_bias BD <span class="token operator">=</span> torch<span class="token punctuation">.</span>einsum<span class="token punctuation">(</span><span class="token string">'ibnd,jnd->ijbn'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>rr_head_q<span class="token punctuation">,</span> r_head_k<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># qlen x klen x bsz x n_head</span> BD <span class="token operator">=</span> self<span class="token punctuation">.</span>_rel_shift<span class="token punctuation">(</span>BD<span class="token punctuation">)</span> <span class="token comment"># [qlen x klen x bsz x n_head]</span> attn_score <span class="token operator">=</span> AC <span class="token operator">+</span> BD attn_score<span class="token punctuation">.</span>mul_<span class="token punctuation">(</span>self<span class="token punctuation">.</span>scale<span class="token punctuation">)</span> <span class="token comment">#### compute attention probability</span> <span class="token keyword">if</span> attn_mask <span class="token keyword">is</span> <span class="token operator">not</span> <span class="token boolean">None</span> <span class="token operator">and</span> attn_mask<span class="token punctuation">.</span><span class="token builtin">any</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>item<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">if</span> attn_mask<span class="token punctuation">.</span>dim<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">2</span><span class="token punctuation">:</span> attn_score <span class="token operator">=</span> attn_score<span class="token punctuation">.</span><span class="token builtin">float</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>masked_fill<span class="token punctuation">(</span> attn_mask<span class="token punctuation">[</span><span class="token boolean">None</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token builtin">float</span><span class="token punctuation">(</span><span class="token string">'inf'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">.</span>type_as<span class="token punctuation">(</span>attn_score<span class="token punctuation">)</span> <span class="token keyword">elif</span> attn_mask<span class="token punctuation">.</span>dim<span class="token punctuation">(</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">3</span><span class="token punctuation">:</span> attn_score <span class="token operator">=</span> attn_score<span class="token punctuation">.</span><span class="token builtin">float</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>masked_fill<span class="token punctuation">(</span> attn_mask<span class="token punctuation">[</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token punctuation">:</span><span class="token punctuation">,</span><span class="token boolean">None</span><span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token operator">-</span><span class="token builtin">float</span><span class="token punctuation">(</span><span class="token string">'inf'</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">.</span>type_as<span class="token punctuation">(</span>attn_score<span class="token punctuation">)</span> <span class="token comment"># [qlen x klen x bsz x n_head]</span> attn_prob <span class="token operator">=</span> F<span class="token punctuation">.</span>softmax<span class="token punctuation">(</span>attn_score<span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token number">1</span><span class="token punctuation">)</span> attn_prob <span class="token operator">=</span> self<span class="token punctuation">.</span>dropatt<span class="token punctuation">(</span>attn_prob<span class="token punctuation">)</span> <span class="token comment">#### compute attention vector</span> attn_vec <span class="token operator">=</span> torch<span class="token punctuation">.</span>einsum<span class="token punctuation">(</span><span class="token string">'ijbn,jbnd->ibnd'</span><span class="token punctuation">,</span> <span class="token punctuation">(</span>attn_prob<span class="token punctuation">,</span> w_head_v<span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token comment"># [qlen x bsz x n_head x d_head]</span> attn_vec <span class="token operator">=</span> attn_vec<span class="token punctuation">.</span>contiguous<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">.</span>view<span class="token punctuation">(</span> attn_vec<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">,</span> attn_vec<span class="token punctuation">.</span>size<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">,</span> self<span class="token punctuation">.</span>n_head <span class="token operator">*</span> self<span class="token punctuation">.</span>d_head<span class="token punctuation">)</span> <span class="token comment">##### linear projection</span> attn_out <span class="token operator">=</span> self<span class="token punctuation">.</span>o_net<span class="token punctuation">(</span>attn_vec<span class="token punctuation">)</span> attn_out <span class="token operator">=</span> self<span class="token punctuation">.</span>drop<span class="token punctuation">(</span>attn_out<span class="token punctuation">)</span> <span class="token keyword">if</span> self<span class="token punctuation">.</span>pre_lnorm<span class="token punctuation">:</span> <span class="token comment">##### residual connection</span> output <span class="token operator">=</span> w <span class="token operator">+</span> attn_out <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token comment">##### residual connection + layer normalization</span> output <span class="token operator">=</span> self<span class="token punctuation">.</span>layer_norm<span class="token punctuation">(</span>w <span class="token operator">+</span> attn_out<span class="token punctuation">)</span> <span class="token keyword">return</span> output
其中n_head,d_model,d_head
分别表示注意力头的个数,模型的隐层维度,每个头的隐层维度。qkv_net
是用于计算query、key和value变换的参数矩阵
W
q
,
W
k
,
E
,
W
v
W
q
,
W
k
,
E
,
W
v
W
q
,
W
k
,
E
,
W
v
Wq,Wk,E,WvW_{q}, W_{k,E}, W_{v}Wq,Wk,E,Wv
Wq,Wk,E,WvWq,Wk,E,WvWq,Wk,E,Wv,与标准的Transformer中一致,o_net
是用于将所有注意力头的结果拼接后再变换到模型维度的参数矩阵,layer_norm
是LayerNormalization层,r_net
是用于计算relative position embedding变换的参数矩阵
W
k
,
R
W
k
,
R
W
k
,
R
Wk,RW_{k,R}Wk,R
Wk,RWk,RWk,R。
在前向计算的过程中,w
和r
分别是上一层的输出以及RelativePositionEmbedding,r_w_bias
和r_r_bias
分别是
u
u
u
uuu
uuu向量和
v
v
v
vvv
vvv向量,AC
是前面公式中的(a)项和(c)项,BD
是前面公式中的(b)项和(d)项,根据前面讲的快速计算带有相对位置的项,这里的BD
需要进行偏移,即_rel_shift
,经过笔者的演算,发现这里经过此函数后的BD并不是想要的
B
B
B
BBB
BBB矩阵,其在
B
B
B
BBB
BBB矩阵的(M+1)对角线(设主对角线为0,正数即为向右上偏移的量)的右上还有元素,不过后面紧接着就进行了mask。这里的attn_mask
即为torch.triu(word_emb.new_ones(qlen, klen), diagonal=1+mlen).byte()[:,:,None]
。再往后就是标准的Transformer中的add&norm环节了,就不再赘述。
- 最后来看memory的更新过程:
def _update_mems(self, hids, mems, qlen, mlen): # does not deal with None if mems is None: return None
<span class="token comment"># mems is not None</span> <span class="token keyword">assert</span> <span class="token builtin">len</span><span class="token punctuation">(</span>hids<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token builtin">len</span><span class="token punctuation">(</span>mems<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'len(hids) != len(mems)'</span> <span class="token comment"># There are `mlen + qlen` steps that can be cached into mems</span> <span class="token comment"># For the next step, the last `ext_len` of the `qlen` tokens</span> <span class="token comment"># will be used as the extended context. Hence, we only cache</span> <span class="token comment"># the tokens from `mlen + qlen - self.ext_len - self.mem_len`</span> <span class="token comment"># to `mlen + qlen - self.ext_len`.</span> <span class="token keyword">with</span> torch<span class="token punctuation">.</span>no_grad<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span> new_mems <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">]</span> end_idx <span class="token operator">=</span> mlen <span class="token operator">+</span> <span class="token builtin">max</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> qlen <span class="token operator">-</span> <span class="token number">0</span> <span class="token operator">-</span> self<span class="token punctuation">.</span>ext_len<span class="token punctuation">)</span> beg_idx <span class="token operator">=</span> <span class="token builtin">max</span><span class="token punctuation">(</span><span class="token number">0</span><span class="token punctuation">,</span> end_idx <span class="token operator">-</span> self<span class="token punctuation">.</span>mem_len<span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>hids<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">:</span> cat <span class="token operator">=</span> torch<span class="token punctuation">.</span>cat<span class="token punctuation">(</span><span class="token punctuation">[</span>mems<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">,</span> hids<span class="token punctuation">[</span>i<span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">,</span> dim<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">)</span> new_mems<span class="token punctuation">.</span>append<span class="token punctuation">(</span>cat<span class="token punctuation">[</span>beg_idx<span class="token punctuation">:</span>end_idx<span class="token punctuation">]</span><span class="token punctuation">.</span>detach<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">return</span> new_mems
这里的hids
是当前段每层的输出,mems
为当前段每层依赖的memory,qlen
为序列长度,mlen
为当前段依赖的memory的长度。
从代码来看的话,前面的循环示意图似乎有些问题?感觉在训练阶段,对于每个段里面的第二个位置开始的点,都应该连到第一个位置连到的最前面memory?因为用的是同样长度的memory。
五. 实验结果
1. 语言建模指标
在最关心的语言模型建模指标上,论文比较了模型在单词级别和字符级别上不同数据集的表现,并且与RNN和(vanilla) Transformer都做了比较。实验证明,Transformer-XL在各个不同的数据集上均实现了目前的SoTA:在大型单词级别数据集WikiText-103上,Transformer-XL将困惑度从20.5降到18.3;在enwiki8数据集上,12层Transformer-XL的bpc达到了1.06,相同bpc的AI-Rfou的模型( https://arxiv.org/abs/1808.04444 )参数量却是6倍,24层Transformer-XL的bpc更是达到了0.99;在One Billion Word数据集上(仅具有短句的)和Penn Treebank数据集上(小型,仅有1M)也取得了SoTA的效果,前者的困惑度从23.7到21.8,后者的困惑度从55.3到54.5。表明了Transformer-XL在各个数据集下的不俗竞争力。
2. 两个创新点的优势
下图比较了不同上下文长度(即memory的长度)中包不包含循环机制、以及使不使用新位置编码方式的困惑度得分。可见,使用循环机制和相对位置编码的Transformer-XL明显优于其他的模型,并且能够有效利用长期依赖性,而且它能捕获超出RNN 80%的依赖性,和超出Transformer 450%的依赖性。
3. 测试阶段的速度
Transformer-XL的推理速度也明显快于vanilla Transformer,尤其是对于较长的上下文。比如,在上下文长度为800时,Transformer-XL提速363倍;而当上下文长度增加到3800时,Transformer-XL提速1874倍!
六. 总结
1. 模型特点
在 AI-Rfou 等人提出的vanilla Transformer上做了两点创新:
- 引入循环机制(Recurrence Mechanism)
- 相对位置编码(Relative Positional Encoding)
2. 优点
- 在几种不同的数据集(大/小,字符级别/单词级别等)均实现了最先进的语言建模结果。
- 结合了深度学习的两个重要概念——循环机制和注意力机制,允许模型学习长期依赖性,且可能可以扩展到需要该能力的其他深度学习领域,例如音频分析(如每秒16k样本的语音数据)等。
- 在inference阶段非常快,比之前最先进的利用Transformer模型进行语言建模的方法快300~1800倍。
- 有详尽的源码!含TensorFlow和PyTorch版本的,并且有TensorFlow预训练好的模型及各个数据集上详尽的超参数设置。
3. 不足
- 尚未在具体的NLP任务如情感分析、QA等上应用。
- 没有给出与其他的基于Transformer的模型,如BERT等,对比有何优势。
- 在Github源码中提到,目前的sota结果是在TPU大集群上训练得出,对于我等渣机器党就只能玩玩base模式了。
传送门
论文:https://arxiv.org/pdf/1901.02860.pdf
代码:https://github.com/kimiyoung/transformer-xl
参考:https://www.lyrn.ai/2019/01/16/transformer-xl-sota-language-model
</div>
<link href="https://csdnimg.cn/release/phoenix/mdeditor/markdown_views-e44c3c0e64.css" rel="stylesheet">
</div>
</article>
<div class="hide-article-box hide-article-pos text-center">
<a class="btn-readmore" data-report-view='{"mod":"popu_376","dest":"https://blog.csdn.net/magical_bubble/article/details/89060213","strategy":"readmore"}' data-report-click='{"mod":"popu_376","dest":"https://blog.csdn.net/magical_bubble/article/details/89060213","strategy":"readmore"}'>
展开阅读全文
<svg class="icon chevrondown" aria-hidden="true">
<use xlink:href="#csdnc-chevrondown"></use>
</svg>
</a>
</div>
<div id="dmp_ad_58"><div id="kp_box_58" data-pid="58" data-track-view='{"mod":"kp_popu_58-386","con":",,"}' data-track-click='{"mod":"kp_popu_58-386","con":",,"}' data-report-view='{"mod":"kp_popu_58-386","keyword":""}' data-report-click='{"mod":"kp_popu_58-386","keyword":""}'><div style="width:100%;background:#fff;">
Transformer一统江湖:自然语言处理三大特征抽取器比较
04-10 阅读数 913
转自:https://baijiahao.baidu.com/s?id=1622615581125501799&wfr=spider&for=pc【新智元导读】自然语言处理中的三大特征... 博文 来自: WitsMakeMen的专栏
谷歌开源先进语言模型Transformer-XL:集Transformer和RNN之大成
01-26 阅读数 98
近日,谷歌联合CMU开源了一个名为Transformer-XL的语言模型,它是目前处理语言建模问题最先进的架构之一Transformer模型的第三代升级,不仅能够处理可变长度序列,并且在多个任务中刷新... 博文 来自: cpongo4的博客
[NLP论文笔记] Transformer-XL 阅读笔记
01-13 阅读数 94
就在前两天,ZihangDai和ZhilinYang最新提出了NLP利器Transformer的升级版——Transformer-XL(eXtraLong),并在5个数据集上获得了非常好的效果,在速度... 博文 来自: weixin_33843409的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_59" data-pid="59" data-track-view='{"mod":"kp_popu_59-78","con":",,"}' data-track-click='{"mod":"kp_popu_59-78","con":",,"}' data-report-view='{"mod":"kp_popu_59-78","keyword":""}' data-report-click='{"mod":"kp_popu_59-78","keyword":""}'><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3491668",
container: s
});
})();
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
02-02 阅读数 1669
长度可以不一样的语言模型 (就是依赖下一层和下一层的前一段)https://arxiv.org/pdf/1901.02860.pdftransformer框架有学习长... 博文 来自: candy134834的博客
CMU和谷歌联手放出XL号Transformer!提速1800倍 | 代码+预训练模型+超参数
01-15 阅读数 582
乾明发自凹非寺量子位出品|公众号QbitAIXL号的Transformer来了!近日,CMU和谷歌联手发布一篇论文,介绍了一种新的语言建模方法Transfor...... 博文 来自: 量子位
放弃幻想,全面拥抱Transformer:NLP三大特征抽取器(CNN/RNN/TF)比较
01-13 阅读数 3026
作者|张俊林,中国中文信息学会理事,目前在新浪微博AILab担任资深算法专家。在此之前,张俊林曾经在阿里巴巴任资深技术专家,以及在百度和用友担任技术经理及技术总...... 博文 来自: AI科技大本营
<div class="recommend-item-box recommend-box-ident recommend-download-box clearfix" data-report-view='{"mod":"popu_387","dest":"https://download.csdn.net/download/hanelyuki/10941087","strategy":"BlogCommendFromBaidu_7"}' data-report-click='{"mod":"popu_387","dest":"https://download.csdn.net/download/hanelyuki/10941087","strategy":"BlogCommendFromBaidu_7"}'>
<a href="https://download.csdn.net/download/hanelyuki/10941087" target="_blank">
<div class="content clearfix">
<div class="">
<h4 class="text-truncate oneline clearfix">
<em>Transformer-XL</em> <em>论文</em> </h4>
<span class="data float-right">01-28</span>
</div>
<div class="desc oneline">
这是google最新推出的语言模型,是对《Attention is what you need》中的Transformer的升级版,它可以用在语言模型、对话系统等任务中。 </div>
<span class="type-show type-show-download">下载</span>
</div>
</a>
</div>
利用模板导出文件(一)之XLSTransformer导出excel文件
08-24 阅读数 8403
由于现在好多公司都在实行办公无纸化操作,所以一般都是使用excel以及word来办公,本文是公司项目中使用excel文件模板生成对应的文件:首先,需要导入一下几个包:接下来就是具体的代码:import... 博文 来自: Fishroad的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_60" data-pid="60" data-track-view='{"mod":"kp_popu_60-43","con":",,"}' data-track-click='{"mod":"kp_popu_60-43","con":",,"}' data-report-view='{"mod":"kp_popu_60-43","keyword":""}' data-report-click='{"mod":"kp_popu_60-43","keyword":""}'><div id="three_ad8" class="mediav_ad" ></div>
linux上安装Docker(非常简单的安装方法)
06-29 阅读数 33万+
最近比较有空,大四出来实习几个月了,作为实习狗的我,被叫去研究Docker了,汗汗!Docker的三大核心概念:镜像、容器、仓库镜像:类似虚拟机的镜像、用俗话说就是安装文件。容器:类似一个轻量级的沙箱... 博文 来自: 我走小路的博客
【论文快读】DSSD
11-02 阅读数 235
作者:Cheng-YangFu,WeiLiu等链接:https://arxiv.org/abs/1701.06659摘要:本文以新的ResNet101为基础搭建SSD,并引入反卷积层,搭建了DSSD,... 博文 来自: 玄云飘风的博客
<div class="recommend-item-box blog-expert-recommend-box">
<div class="d-flex">
<div class="blog-expert-recommend">
<div class="blog-expert">
<div class="blog-expert-flexbox"></div>
</div>
</div>
</div>
</div>
基于pytorch的ESRGAN(论文阅读笔记+复现)
12-05 阅读数 1148
代码的框架——《https://github.com/xinntao/BasicSR》ESRGAN论文《ESRGAN:EnhancedSuper-ResolutionGenerativeAdversa... 博文 来自: gwpscut的博客
Faster R-CNN 源码解读 (傻瓜版) - Pytorch
03-14 阅读数 583
前言本篇写了很多第一次看代码做的注释。为了便于搞懂核心脉络,对所有的分支选择都做了简化。层次结构与jwyang的实现版本有差异,因为源版本里存在很多冗余代码。目的是构造一个最简训练模型。萌新学的话,可... 博文 来自: w55100的博客
PyTorch faster_rcnn之一源码解读一
03-10 阅读数 378
文章目录数据预处理1.data/dataset.py文件2.data/util.py文件3.data/util.py数据预处理1.data/dataset.py文件&amp;quot;&... 博文 来自: wsp_1138886114的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_61" data-pid="61" data-track-view='{"mod":"kp_popu_61-622","con":",,"}' data-track-click='{"mod":"kp_popu_61-622","con":",,"}' data-report-view='{"mod":"kp_popu_61-622","keyword":""}' data-report-click='{"mod":"kp_popu_61-622","keyword":""}'><script type="text/javascript" src="//rabc1.iteye.com/common/web/production/79m9.js?f=aszggcwz"></script></div></div>
浅谈Transformer 及Attention网络
12-09 阅读数 857
1Transformer模型结构Attention的编码,把一个输入序列(x1,...,xn)(x_1,...,x_n)(x1,...,xn)表示为连续序列z=(z1,...,zn)\mathbf... 博文 来自: rosefun96的博客
DCGAN论文笔记+源码解析
01-25 阅读数 1万+
DCGAN,DeepConvolutionalGenerativeAdversarialNetworks是生成对抗网络(GenerativeAdversarialNetworks)的一种延伸,将卷积网... 博文 来自: XlyPb
推理速度快千倍!谷歌开源语言模型Transformer-XL
01-26 阅读数 1699
语言建模是NLP中的一种重要技术,因为它能够应用在各种NLP任务中,如机器翻译和主题分类等。目前,处理语言建模问题有两种最先进的架构——循环神经网络(RNN)和Transformer。前者处理... 博文 来自: ejinxian的专栏
XLM解读(论文 + PyTorch源码)
04-25 阅读数 136
这篇论文是Facebook在BERT的基础上发展出来的Cross-Lingual版本,即多语的。1.引入了一个新的无监督方法,用于训练多语的表征,并且提出两个单语的预训练LM目标2.提出了一种新的有监... 博文 来自: Magical_Bubble的博客
一文看懂Transformer内部原理(含PyTorch实现)
04-05 阅读数 530
Transformer注解及PyTorch实现原文:http://nlp.seas.harvard.edu/2018/04/03/attention.html 作者:AlexanderRush 转... 博文 来自: omnispace的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_62" data-pid="62" data-track-view='{"mod":"kp_popu_62-623","con":",,"}' data-track-click='{"mod":"kp_popu_62-623","con":",,"}' data-report-view='{"mod":"kp_popu_62-623","keyword":""}' data-report-click='{"mod":"kp_popu_62-623","keyword":""}'><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3600849",
container: s
});
})();
免费直播 | Transformer新型神经网络在机器翻译中的应用
07-21 阅读数 833
机器翻译是自然语言处理领域的皇冠明珠,学术界和产业界的研究人员已经致力于机器翻译研究很多年,从最早的基于规则,到基于统计模型,再到基于神经网络,发展速度是高速倍增的。近几...... 博文 来自: CSDN人工智能头条
Transformer 和 Transformer-XL——从基础框架理解BERT与XLNet
06-24 阅读数 216
目录写在前面1.Transformer从哪里来?有什么不同?2.Transformer-xl小尾巴是什么写在前面今天从微信上刷着消息,猛然发现又有一个超强预训练语言模型问世——XLNet,它由卡耐基梅... 博文 来自: berryfish的博客
Transformer技术学习(原理+代码)
05-28 阅读数 50
Transformer技术学习(原理+代码)论文Transformer原理Transformer代码论文https://arxiv.org/abs/1706.03762Transformer原理1【N... 博文 来自: weixin_42985103的博客
Transformer-XL:超长上下文依赖
07-07 阅读数 38
解决的问题Transformer的自注意力机制可以让长距离的单词直接联系,可以很容易地学习到句子之间的长距离依赖。但是在将Transformer应用在语言模型时,核心的问题在于如何将任意长度的cont... 博文 来自: Little Garden
深度学习领域PyTorch项目-git源码整理
08-22 阅读数 1万+
原文地址:http://www.sohu.com/a/164171974_741733本文收集了大量基于PyTorch实现的代码链接,其中有适用于深度学习新手的“入门指导系列”,也有适用于老司机的论文... 博文 来自: 一花一世界 一叶一菩提
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_63" data-pid="63" data-track-view='{"mod":"kp_popu_63-1405","con":",,"}' data-track-click='{"mod":"kp_popu_63-1405","con":",,"}' data-report-view='{"mod":"kp_popu_63-1405","keyword":""}' data-report-click='{"mod":"kp_popu_63-1405","keyword":""}'><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u4221910",
container: s
});
})();
Inception v1 论文及源码
08-14 阅读数 1569
论文解读Network-in-networkR-CNNMultivationandHighLevelConsiderationsArchitecturalDetails源码分析函数源码Goingdee... 博文 来自: zhoujunr1的博客
(深度卷积生成对抗神经网络)DCGANs论文阅读与实现pytorch
11-10 阅读数 706
INTRODUCTIONGANs有一个非常大的问题,就是训练的结果非常不稳定,很有可能训练出来的模型生成的是一个乱七八糟的东西。GANshavebeenknowntobeunstabletotrain... 博文 来自: 肥宅Sean
Res2Net论文解读
04-04 阅读数 2116
论文:https://arxiv.org/abs/1904.01169 Abstract:在多个尺度上表示特征对于许多视觉任务非常重要。卷积神经网络(CNN)backbone的最新进展不断展示出... 博文 来自: 嘿芝麻的树洞
pytorch实现GAN代码详解
07-20 阅读数 1526
设置超参数。 (name,preprocess,d_input_func)=("Dataandvariances",lambdadata:decorate_with_diffs(d... 博文 来自: u012759006的博客
ULMFiT解读(论文 + PyTorch源码)
04-24 阅读数 152
可能是笔者孤陋寡闻,感觉这篇论文没有BERT、ELMo这么火,笔者也是在搜索相关话题的文章的时候,看到大家都会带着ULMFiT进行分析,因此也就去研究了一下。总体来说,这篇论文也是pretrain+f... 博文 来自: Magical_Bubble的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_64" data-pid="64" data-track-view='{"mod":"kp_popu_64-1060","con":",,"}' data-track-click='{"mod":"kp_popu_64-1060","con":",,"}' data-report-view='{"mod":"kp_popu_64-1060","keyword":""}' data-report-click='{"mod":"kp_popu_64-1060","keyword":""}'><iframe src="https://kunpeng-sc.csdnimg.cn/#/preview/235?positionId=64&queryWord=" frameborder="0" width= "100%" height= "75px" scrolling="no" ></iframe></div></div>
Transformer-XL:释放注意力模型的潜力
02-19 阅读数 700
文/ZhilinYang和QuocLe,GoogleAI团队为了正确理解一篇文章,读者有时需要返回前文,参考在几千字之前出现的一个词或句子。这是一个长程依...... 博文 来自: 谷歌开发者
ELMo解读(论文 + PyTorch源码)
04-11 阅读数 442
ELMo出自Allen研究所在NAACL会议上发表的一篇论文《Deepcontextualizedwordrepresentations》,从论文名称看,应该是提出了一个新的词表征的方法。据他们自己的... 博文 来自: Magical_Bubble的博客
transformer xl 用于文本生成
05-30 阅读数 194
本文尝试用transformerxl做中文文本续写,基于论文为:《Transformer-XL:AttentiveLanguageModelsBeyondaFixed-LengthContext》ht... 博文 来自: penkgao的博客
Transformer-XL模型:Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
04-15 阅读数 144
参考链接参考论文:https://arxiv.org/abs/1901.02860参考博客:https://ai.googleblog.com/2019/01/transformer-xl-unlea... 博文 来自: ACM_hades的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_65" data-pid="65" data-track-view='{"mod":"kp_popu_65-625","con":",,"}' data-track-click='{"mod":"kp_popu_65-625","con":",,"}' data-report-view='{"mod":"kp_popu_65-625","keyword":""}' data-report-click='{"mod":"kp_popu_65-625","keyword":""}'><script type="text/javascript" src="//rabc1.iteye.com/common/openjs/m022.js?hcuzbzy=bi"></script></div></div>
论文笔记 — Transformer-XL [更优秀的长文本编码器]
06-26 阅读数 54
FromGoogleBrainandCMU.Authors:ZihangDai∗,ZhilinYang∗,YimingYang,JaimeCarbonell,QuocV.Le,RuslanSalakh... 博文 来自: IndexFziQ CSDN
关于transformer模型总结(源码)
08-21 阅读数 5508
本文主要是对transfermer模型的源码进行解析:transfermer主要结构是由encoder和decoder构成。其中,encoder是由embedding+positional_encod... 博文 来自: yiyele的博客
transform-xl翻译
01-22 阅读数 262
1.介绍语言建模是需要对长期依赖关系建模的重要问题之一,它具有成功的应用程序,如无监督的训练(Petersetal.,2018;Devlinetal.,2018)。然而,如何使神经网络具备在序列数据... 博文 来自: qq_28616213的博客
Transformer 代码
06-03 阅读数 30
前言讲解完了Transformer的原理,现在要实战一下,想通过小规模数据集来运行一个demo,来看看效果。代码讲解参考博客... 博文 来自: 笔记小屋
PGGAN笔记(未完待续)
11-27 阅读数 292
1.ProgressiveGrowingofGANs原始GAN存在的问题:当需要生成的图像分辨率非常高时,判别器D很容易就可以识别出G生成的“假图像”,G难以训练为解决这个问题,文章提出了渐进增长的训... 博文 来自: weixin_41152041的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_66" data-pid="66" data-track-view='{"mod":"kp_popu_66-87","con":",,"}' data-track-click='{"mod":"kp_popu_66-87","con":",,"}' data-report-view='{"mod":"kp_popu_66-87","keyword":""}' data-report-click='{"mod":"kp_popu_66-87","keyword":""}'><div id="three_ad38" class="mediav_ad" ></div>
faster rcnn中RPN网络源码分析(pytorch)
05-07 阅读数 370
最近刚入坑检测,初步看了RGB大佬的fasterrcnn文章,再看看源码本次分析的源码是陈云大佬pytorch版本的GITHUB地址上一张输入输出图一、forward主文件./model/region... 博文 来自: NO CODE NO LIFE
PyTorch实现的各类论文和代码参考(安利供保存收藏)
04-06 阅读数 278
文章地址机器之心:https://www.jiqizhixin.com/articles/102101一篇翻译,主要是关于PyTorch的内容,提供了代码支持,项目地址:https://github.... 博文 来自: 家有代码初写成 的博客
FaceBoxes人脸检测(阅读整理)
01-22 阅读数 376
论文理解部分:https://blog.csdn.net/qq_40859461/article/details/85161171https://www.cnblogs.com/ocean1100/p... 博文 来自: weixin_40355324的博客
Attention is all you need pytorch实现 源码解析04 - 模型的测试以及翻译
02-16 阅读数 222
今天是最后一节对Attentionisallyouneedpytorch实现的解析,这一节非常的简单,我将会一笔带过。上一讲连接在此:Attentionisallyouneedpytorch实现源码解... 博文 来自: 蓝一潇的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_67" data-pid="67" data-track-view='{"mod":"kp_popu_67-658","con":",,"}' data-track-click='{"mod":"kp_popu_67-658","con":",,"}' data-report-view='{"mod":"kp_popu_67-658","keyword":""}' data-report-click='{"mod":"kp_popu_67-658","keyword":""}'><script type="text/javascript">
(function() {
var s = "_" + Math.random().toString(36).slice(2);
document.write('<div style="" id="' + s + '"></div>');
(window.slotbydup = window.slotbydup || []).push({
id: "u3573058",
container: s
});
})();
图解Transformer
12-12 阅读数 2万+
原文标题:TheIllustratedTransformer原文链接:https://jalammar.github.io/illustrated-transformer/论文地址:https://a... 博文 来自: 夏目的博客
Transformer源码解读
01-17 阅读数 1850
转载请注明出处之前我们一起了解了attention、transformer的原理,本文将会基于github的一个transformer(下文会针对我对该代码的一个改版讲解)开源代码进行代码分析讲解,该... 博文 来自: u012526436的博客
BERT、GPT-2这些顶尖工具到底该怎么用到我的模型里?
03-22 阅读数 420
转自:http://www.dataguru.cn/article-14544-1.html近期的NLP方向,ELMO、GPT、BERT、Transformer-XL、GPT-2,各种预训练语言模型层... 博文 来自: lrt366的博客
Transformer解读(论文 + PyTorch源码)
04-08 阅读数 496
2017年6月,Google发布了一篇论文《AttentionisAllYouNeed》,提出了Transformer模型。正如论文的名称所说,其旨在全部利用Attention方式来替代掉RNN的循环... 博文 来自: Magical_Bubble的博客
transformer语言模型原理解读
05-20 阅读数 19
一、简介基于假设:一个词在句子中的意思,与上下文(语境)有关。与哪些词有关呢?Transformer就是:利用点积将句子中所有词的影响当成权重都考虑了进去。Papper模型图Transform模型是与... 博文 来自: weixin_33744141的博客
<div class="recommend-item-box recommend-ad-box"><div id="kp_box_68" data-pid="68" data-track-view='{"mod":"kp_popu_68-625","con":",,"}' data-track-click='{"mod":"kp_popu_68-625","con":",,"}' data-report-view='{"mod":"kp_popu_68-625","keyword":""}' data-report-click='{"mod":"kp_popu_68-625","keyword":""}'><script type="text/javascript" src="//rabc1.iteye.com/common/openjs/m022.js?hcuzbzy=bi"></script></div></div>
transformer解读与pytorch实现
04-22 阅读数 31
https://juejin.im/post/5b9f1af0e51d450e425eb32dimporttorchimporttorch.nnasnnclassTransformer(nn.Modu... 博文 来自: Mr_wuliboy的博客
<div class="recommend-item-box type_hot_word">
<div class="content clearfix">
<div class="word float-left">
<span>
<a href="https://edu.csdn.net/courses/o5329_s5330_k " target="_blank">
机器学习教程 </a></span>
<span>
<a href="https://edu.csdn.net/courses/o280_s351_k " target="_blank">
Objective-C培训 </a></span>
<span>
<a href="https://edu.csdn.net/combos/o7115_s388_l0_t " target="_blank">
交互设计视频教程 </a></span>
<span>
<a href="https://edu.csdn.net/course/play/5599/104252 " target="_blank">
颜色模型 </a></span>
<span>
<a href="https://edu.csdn.net/combos/o363_l0_t " target="_blank">
设计制作学习 </a></span>
</div>
</div>
<div class="content clearfix">
<div class="float-left">
<span>
<a href="https://www.csdn.net/gather_24/MtTaEg3sMDM5MS1ibG9n.html" target="_blank">
mysql关联查询两次本表</a>
</span>
<span>
<a href="https://www.csdn.net/gather_10/MtjaIg3sMTUzMy1kb3dubG9hZAO0O0OO0O0O.html" target="_blank">
native底部 react</a>
</span>
<span>
<a href="https://www.csdn.net/gather_1b/Ntzagg1sOTU3LWRvd25sb2Fk.html" target="_blank">
extjs glyph 图标</a>
</span>
<span>
<a href="https://www.csdn.net/gather_4a/NtTaMg0sMDUtZWR1.html" target="_blank">
大数据解读视频</a>
</span>
<span>
<a href="https://www.csdn.net/gather_4a/MtTaIgzsNjMtZWR1.html" target="_blank">
区块链解读课程</a>
</span>
</div>
</div>
</div>
<div class="recommend-loading-box">
<img src='https://csdnimg.cn/release/phoenix/images/feedLoading.gif'>
</div>
<div class="recommend-end-box">
<p class="text-center">没有更多推荐了,<a href="https://blog.csdn.net/" class="c-blue c-blue-hover c-blue-focus">返回首页</a></p>
</div>
</div>
</main>
<aside>
<div id="asideProfile" class="aside-box">
<!-- <h3 class="aside-title">个人资料</h3> -->
<div class="profile-intro d-flex">
<div class="avatar-box d-flex justify-content-center flex-column">
<a href="https://blog.csdn.net/Magical_Bubble">
<img src="https://avatar.csdn.net/C/F/4/3_magical_bubble.jpg" class="avatar_pic">
<img src="https://g.csdnimg.cn/static/user-reg-year/1x/5.png" class="user-years">
</a>
</div>
<div class="user-info d-flex justify-content-center flex-column">
<p class="name csdn-tracking-statistics tracking-click" data-report-click='{"mod":"popu_379"}'>
<a href="https://blog.csdn.net/Magical_Bubble" target="_blank" class="" id="uid">MagicBubble</a>
</p>
</div>
<div class="opt-box d-flex justify-content-center flex-column">
<span class="csdn-tracking-statistics tracking-click" data-report-click='{"mod":"popu_379"}'>
<a class="btn btn-sm btn-red-hollow attention" id="btnAttent">关注</a>
</span>
</div>
</div>
<div class="data-info d-flex item-tiling">
<dl class="text-center" title="26">
<dt><a href="https://blog.csdn.net/magical_bubble?t=1">原创</a></dt>
<dd><a href="https://blog.csdn.net/magical_bubble?t=1"><span class="count">26</span></a></dd>
</dl>
<dl class="text-center" id="fanBox" title="33">
<dt>粉丝</dt>
<dd><span class="count" id="fan">33</span></dd>
</dl>
<dl class="text-center" title="23">
<dt>喜欢</dt>
<dd><span class="count">23</span></dd>
</dl>
<dl class="text-center" title="25">
<dt>评论</dt>
<dd><span class="count">25</span></dd>
</dl>
</div>
<div class="grade-box clearfix">
<dl>
<dt>等级:</dt>
<dd>
<a href="https://blog.csdn.net/home/help.html#level" title="2级,点击查看等级说明" target="_blank">
<svg class="icon icon-level" aria-hidden="true">
<use xlink:href="#csdnc-bloglevel-2"></use>
</svg>
</a>
</dd>
</dl>
<dl>
<dt>访问:</dt>
<dd title="8976">
8976 </dd>
</dl>
<dl>
<dt>积分:</dt>
<dd title="374">
374 </dd>
</dl>
<dl title="263545">
<dt>排名:</dt>
<dd>26万+</dd>
</dl>
</div>
<div class="badge-box d-flex">
<span>勋章:</span>
<div class="badge d-flex">
<div class="icon-badge" title="持之以恒">
<div class="mouse-box">
<img src="https://g.csdnimg.cn/static/user-medal/chizhiyiheng.svg" alt="">
<div class="icon-arrow"></div>
</div>
<div class="grade-detail-box">
<div class="pos-box">
<div class="left-box d-flex justify-content-center align-items-center flex-column">
<img src="https://g.csdnimg.cn/static/user-medal/chizhiyiheng.svg" alt="">
<p>持之以恒</p>
</div>
<div class="right-box">
授予每个自然月内发布4篇或4篇以上原创或翻译IT博文的用户。不积跬步无以至千里,不积小流无以成江海,程序人生的精彩需要坚持不懈地积累! </div>
</div>
</div>
</div>
<div class="icon-badge" title="勤写标兵Lv3">
<div class="mouse-box">
<img src="https://g.csdnimg.cn/static/user-medal/qinxiebiaobing_l3_t1.svg" alt="">
<div class="icon-arrow"></div>
</div>
<div class="grade-detail-box">
<div class="pos-box">
<div class="left-box d-flex justify-content-center align-items-center flex-column">
<img src="https://g.csdnimg.cn/static/user-medal/qinxiebiaobing_l3_t1.svg" alt="">
<p>勤写标兵Lv3</p>
</div>
<div class="right-box">
授予每个自然周发布7篇到8篇原创IT博文的用户。本勋章将于次周上午根据用户上周的博文发布情况由系统自动颁发。 </div>
</div>
</div>
</div>
</div>
<script>
(function ($) {
setTimeout(function(){
$('div.icon-badge.show-moment').removeClass('show-moment');
}, 5000);
})(window.jQuery)
</script>
</div>
</div>