绝对位置编码
Bert、ALBert等模型使用
绝对位置编码每个位置训练一个embedding向量,尺寸为max_position_embeddings和hidden_size
#初始化
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
# 计算位置嵌入
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings
优势:不考虑外推的情况下,效果较好。
缺点:不具备外推性,max_position_embeddings在pretrain的时候就固定了。
正旋位置编码
基于Sinusoidal的位置编码由transform文章提出,公式如下
P
E
(
p
o
s
,
2
i
)
=
sin
(
p
o
s
/
b
a
s
e
2
i
/
d
)
P
E
(
p
o
s
,
2
i
+
1
)
=
cos
(
p
o
s
/
b
a
s
e
2
i
/
d
)
\begin{array}{l} P E_{(p o s, 2 i)}=\sin \left(p o s / base^{2 i / d}\right) \\ P E_{(p o s, 2 i+1)}=\cos \left(p o s / base^{2 i / d}\right) \end{array}
PE(pos,2i)=sin(pos/base2i/d)PE(pos,2i+1)=cos(pos/base2i/d)
POS是位置索引,i 是隐向量维度索引。一般base=10000,随着base的变大,周期会明显变长。
优势:
1、 具有相对位置表达能力:Sinusoidal可以学习到相对位置,对于固定位置距离的k,PE(i+k)可以表示成PE(i)的线性函数。
P
E
(
t
,
2
i
)
=
sin
(
t
∗
w
2
i
)
P
E
(
t
,
2
i
+
1
)
=
cos
(
t
∗
w
2
i
)
w
2
i
=
1
/
1000
0
2
i
/
d
P
E
(
t
+
k
,
2
i
)
=
sin
(
t
∗
w
2
i
+
k
∗
w
2
i
)
=
sin
(
t
∗
w
2
i
)
cos
(
k
∗
w
2
i
)
+
cos
(
t
∗
w
2
i
)
sin
(
k
∗
w
2
i
)
=
P
E
(
t
,
2
i
)
P
E
(
k
,
2
i
+
1
)
+
P
E
(
t
,
2
i
+
1
)
P
E
(
k
,
2
i
)
=
P
E
(
t
,
2
i
)
u
+
P
E
(
t
,
2
i
+
1
)
v
P
E
(
t
+
k
,
2
i
+
1
)
=
cos
(
t
∗
w
2
i
+
k
∗
w
2
i
)
=
cos
(
t
∗
w
2
i
)
cos
(
k
∗
w
2
i
)
−
sin
(
t
∗
w
2
i
)
sin
(
k
∗
w
2
i
)
=
P
E
(
t
,
2
i
+
1
)
P
E
(
k
,
2
i
+
1
)
−
P
E
(
t
,
2
i
)
P
E
(
k
,
2
i
)
=
P
E
(
t
,
2
i
+
1
)
u
−
P
E
(
t
,
2
i
)
v
\begin{array}{l} \begin{array}{l} P E(t, 2 i)=\sin \left(t * w_{2 i}\right) \\ P E(t, 2 i+1)=\cos \left(t * w_{2 i}\right) \\ w_{2 i}=1 / 10000^{2 i / d} \end{array}\\ \begin{aligned} P E(t+k, 2 i) & =\sin \left(t * w_{2 i}+k * w_{2 i}\right) \\ & =\sin \left(t * w_{2 i}\right) \cos \left(k * w_{2 i}\right)+\cos \left(t * w_{2 i}\right) \sin \left(k * w_{2 i}\right) \\ & =P E(t, 2 i) P E(k, 2 i+1)+P E(t, 2 i+1) P E(k, 2 i) \\ & =P E(t, 2 i) u+P E(t, 2 i+1) v \end{aligned}\\ \begin{aligned} P E(t+k, 2 i+1) & =\cos \left(t * w_{2 i}+k * w_{2 i}\right) \\ & =\cos \left(t * w_{2 i}\right) \cos \left(k * w_{2 i}\right)-\sin \left(t * w_{2 i}\right) \sin \left(k * w_{2 i}\right) \\ & =P E(t, 2 i+1) P E(k, 2 i+1)-P E(t, 2 i) P E(k, 2 i) \\ & =P E(t, 2 i+1) u-P E(t, 2 i) v \end{aligned} \end{array}
PE(t,2i)=sin(t∗w2i)PE(t,2i+1)=cos(t∗w2i)w2i=1/100002i/dPE(t+k,2i)=sin(t∗w2i+k∗w2i)=sin(t∗w2i)cos(k∗w2i)+cos(t∗w2i)sin(k∗w2i)=PE(t,2i)PE(k,2i+1)+PE(t,2i+1)PE(k,2i)=PE(t,2i)u+PE(t,2i+1)vPE(t+k,2i+1)=cos(t∗w2i+k∗w2i)=cos(t∗w2i)cos(k∗w2i)−sin(t∗w2i)sin(k∗w2i)=PE(t,2i+1)PE(k,2i+1)−PE(t,2i)PE(k,2i)=PE(t,2i+1)u−PE(t,2i)v
2、两个位置向量的内积只和相对位置 k 有关。
P
E
(
t
)
P
E
(
t
+
k
)
=
∑
i
=
0
d
/
2
−
1
P
E
(
t
,
2
i
)
P
E
(
t
+
k
,
2
i
)
+
∑
i
=
0
d
/
2
−
1
P
E
(
t
,
2
i
+
1
)
P
E
(
t
+
k
,
2
i
+
1
)
=
∑
i
=
0
d
/
2
−
1
sin
(
t
∗
w
2
i
)
sin
[
(
t
+
k
)
∗
w
2
i
]
+
∑
i
=
0
d
/
2
−
1
cos
(
t
∗
w
2
i
)
cos
[
(
t
+
k
)
∗
w
2
i
]
=
∑
i
=
0
d
/
2
−
1
cos
(
k
∗
w
2
i
)
\begin{aligned} P E(t) P E(t+k) & =\sum_{i=0}^{d / 2-1} P E(t, 2 i) P E(t+k, 2 i)+\sum_{i=0}^{d / 2-1} P E(t, 2 i+1) P E(t+k, 2 i+1) \\ & =\sum_{i=0}^{d / 2-1} \sin \left(t * w_{2 i}\right) \sin \left[(t+k) * w_{2 i}\right]+\sum_{i=0}^{d / 2-1} \cos \left(t * w_{2 i}\right) \cos \left[(t+k) * w_{2 i}\right] \\ & =\sum_{i=0}^{d / 2-1} \cos \left(k * w_{2 i}\right) \end{aligned}
PE(t)PE(t+k)=i=0∑d/2−1PE(t,2i)PE(t+k,2i)+i=0∑d/2−1PE(t,2i+1)PE(t+k,2i+1)=i=0∑d/2−1sin(t∗w2i)sin[(t+k)∗w2i]+i=0∑d/2−1cos(t∗w2i)cos[(t+k)∗w2i]=i=0∑d/2−1cos(k∗w2i)
3、随着k的增加,内积的结果会直接减少,即会存在远程衰减。
缺点:
但是实际的Attention计算中还需要与attention的权重W相乘。考虑W矩阵后,内积不再只和相对距离k相关了。
旋转位置编码(RoPE)
旋转位置编码可以有效解决Sinusoidal的位置编码内积和相对位置的关系,其Q和K的内积只和相对位置有关。
基础公式
三角函数和差公式:
s i n ( A + B ) = s i n A c o s B + c o s A s i n B s i n ( A − B ) = s i n A c o s B − s i n B c o s A c o s ( A + B ) = c o s A c o s B − s i n A s i n B c o s ( A − B ) = c o s A c o s B + s i n A s i n B t a n ( A + B ) = ( t a n A + t a n B ) / ( 1 − t a n A t a n B ) t a n ( A − B ) = ( t a n A − t a n B ) / ( 1 + t a n A t a n B ) c o t ( A + B ) = ( c o t A c o t B − 1 ) / ( c o t B + c o t A ) c o t ( A − B ) = ( c o t A c o t B + 1 ) / ( c o t B − c o t A ) \begin{array}{l} sin(A+B)=sinAcosB+cosAsinB\\ sin(A-B)=sinAcosB-sinBcosA\\ cos(A+B)=cosAcosB-sinAsinB\\ cos(A-B)=cosAcosB+sinAsinB\\ tan(A+B)=(tanA+tanB)/(1-tanAtanB)\\ tan(A-B)=(tanA-tanB)/(1+tanAtanB) \\ cot(A+B)=(cotAcotB-1)/(cotB+cotA)\\ cot(A-B)=(cotAcotB+1)/(cotB-cotA) \\ \end{array} sin(A+B)=sinAcosB+cosAsinBsin(A−B)=sinAcosB−sinBcosAcos(A+B)=cosAcosB−sinAsinBcos(A−B)=cosAcosB+sinAsinBtan(A+B)=(tanA+tanB)/(1−tanAtanB)tan(A−B)=(tanA−tanB)/(1+tanAtanB)cot(A+B)=(cotAcotB−1)/(cotB+cotA)cot(A−B)=(cotAcotB+1)/(cotB−cotA)
欧拉公式
e i x = cos x + i sin x e^{ix}=\cos x+i\sin x eix=cosx+isinx
共轭复数计算
α + β ‾ = α ‾ + β ‾ α × β ‾ = α ‾ × β ‾ \begin{array}{l} \overline{\alpha + \beta } = \overline{\alpha} + \overline{\beta}\\ \overline{\alpha \times \beta} = \overline{\alpha} \times \overline{\beta} \end{array} α+β=α+βα×β=α×β
旋转推导
希望找到一种变换使得以下条件成立
<
f
q
(
q
m
,
m
)
,
f
k
(
k
n
,
n
)
>
=
g
(
q
m
,
k
n
,
m
−
n
)
<f_{q}\left(\boldsymbol{q}_{m}, m\right), f_{k}\left(\boldsymbol{k}_{n}, n\right)>=g\left(\boldsymbol{q}_{m}, \boldsymbol{k}_{n}, m-n\right)
<fq(qm,m),fk(kn,n)>=g(qm,kn,m−n)
其中
x
m
{x}_{m}
xm是第m位置的隐向量。即QK向量经过某种变换后,其新生成向量的内积只和QK原始向量以及m-n相关。
以下给出旋转位置编码的二维表达形式
f
q
(
q
m
,
m
)
=
q
m
e
i
m
θ
f
k
(
k
n
,
n
)
=
k
n
e
i
n
θ
g
(
q
m
,
k
n
,
m
−
n
)
=
Re
[
f
q
(
q
m
,
m
)
×
f
k
(
k
n
,
n
)
‾
]
=
Re
[
q
m
e
i
m
θ
×
k
n
‾
e
i
n
θ
‾
]
=
Re
[
q
m
e
i
m
θ
e
i
−
n
θ
k
n
‾
]
=
Re
[
q
m
e
i
(
m
−
n
)
θ
k
n
‾
]
\begin{array}{l} f_{q}\left(\boldsymbol{q}_{m}, m\right)=\boldsymbol{q}_{m} e^{i m \theta} \\ f_{k}\left(\boldsymbol{k}_{n}, n\right)=\boldsymbol{k}_{n} e^{i n \theta} \\ \begin{aligned} g\left(\boldsymbol{q}_{m}, \boldsymbol{k}_{n}, m-n\right)&=\operatorname{Re}\left [ f_{q}\left(\boldsymbol{q}_{m}, m\right) \times \overline{f_{k}\left(\boldsymbol{k}_{n}, n\right)} \right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i m \theta} \times \overline{\boldsymbol{k}_{n}} \overline{ e^{i n \theta}} \right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i m \theta} e^{i -n \theta} \overline{\boldsymbol{k}_{n}}\right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i (m-n) \theta} \overline{\boldsymbol{k}_{n}}\right ] \end{aligned} \end{array}
fq(qm,m)=qmeimθfk(kn,n)=kneinθg(qm,kn,m−n)=Re[fq(qm,m)×fk(kn,n)]=Re[qmeimθ×kneinθ]=Re[qmeimθei−nθkn]=Re[qmei(m−n)θkn]
以下给出矩阵表达形式推导,
q
m
{q}_{m}
qm只有2个元素,表示为复数形式为
(
q
1
+
i
q
2
)
({q}_{1}+ i {q}_{2})
(q1+iq2)
(
q
1
+
i
q
2
)
e
i
m
θ
=
(
q
1
cos
m
θ
−
q
2
sin
m
θ
)
+
i
(
q
1
sin
m
θ
+
q
2
cos
m
θ
)
(
k
1
+
i
k
2
)
e
i
m
θ
=
(
k
1
cos
m
θ
−
k
2
sin
m
θ
)
+
i
(
k
1
sin
m
θ
+
k
2
cos
m
θ
)
\begin{array}{l} ({q}_{1}+i {q}_{2}) e^{i m \theta}=({q}_{1} \cos m \theta- {q}_{2} \sin m \theta)+i({q}_{1} \sin m \theta+ {q}_{2} \cos m \theta)\\ ({k}_{1}+i {k}_{2}) e^{i m \theta}=({k}_{1} \cos m \theta- {k}_{2} \sin m \theta)+i({k}_{1} \sin m \theta+ {k}_{2} \cos m \theta)\\ \end{array}
(q1+iq2)eimθ=(q1cosmθ−q2sinmθ)+i(q1sinmθ+q2cosmθ)(k1+ik2)eimθ=(k1cosmθ−k2sinmθ)+i(k1sinmθ+k2cosmθ)
Re
{
[
(
q
1
+
i
q
2
)
e
i
m
θ
]
×
[
(
k
1
+
i
k
2
)
e
i
n
θ
]
‾
}
=
Re
{
[
(
q
1
cos
m
θ
−
q
2
sin
m
θ
)
+
i
(
q
1
sin
m
θ
+
q
2
cos
m
θ
)
]
×
[
(
k
1
cos
n
θ
−
k
2
sin
n
θ
)
−
i
(
k
1
sin
n
θ
+
k
2
cos
n
θ
)
]
}
=
[
(
q
1
cos
m
θ
−
q
2
sin
m
θ
)
×
(
k
1
cos
n
θ
−
k
2
sin
n
θ
)
]
+
[
(
q
1
sin
m
θ
+
q
2
cos
m
θ
)
×
(
k
1
sin
n
θ
+
k
2
cos
n
θ
)
]
=
q
1
k
1
(
cos
m
θ
cos
n
θ
+
sin
m
θ
sin
n
θ
)
+
q
1
k
2
(
−
cos
m
θ
sin
n
θ
+
cos
m
θ
sin
n
θ
)
+
q
2
k
1
(
cos
m
θ
sin
n
θ
−
sin
m
θ
cos
n
θ
)
+
q
2
k
2
(
sin
m
θ
sin
n
θ
+
cos
m
θ
cos
n
θ
)
=
q
1
k
1
cos
(
m
−
n
)
θ
+
q
1
k
2
sin
(
m
−
n
)
θ
+
q
2
k
1
sin
(
m
−
n
)
θ
+
q
2
k
2
cos
(
m
−
n
)
θ
\begin{aligned} &\operatorname{Re}\left \{ \left [({q}_{1}+i {q}_{2}) e^{i m \theta} \right ] \times \overline{\left [ ({k}_{1}+i {k}_{2}) e^{i n \theta} \right ]} \right \} \\ & = \operatorname{Re} \left \{ \left [ ({q}_{1} \cos m \theta- {q}_{2} \sin m \theta)+i({q}_{1} \sin m \theta+ {q}_{2} \cos m \theta) \right ] \times \left [ ({k}_{1} \cos n \theta- {k}_{2} \sin n \theta)-i({k}_{1} \sin n \theta+ {k}_{2} \cos n \theta) \right ] \right \}\\ & = \left [ \left ( {q}_{1} \cos m \theta- {q}_{2} \sin m \theta \right ) \times \left ( {k}_{1} \cos n \theta- {k}_{2} \sin n \theta \right ) \right ] + \left [ \left ( {q}_{1} \sin m \theta+ {q}_{2} \cos m \theta \right ) \times \left ( {k}_{1} \sin n \theta+ {k}_{2} \cos n \theta \right ) \right ]\\ & = {q}_{1}{k}_{1}\left ( \cos m \theta \cos n \theta + \sin m \theta \sin n \theta \right ) + \\ & \quad {q}_{1}{k}_{2}\left ( - \cos m \theta \sin n \theta + \cos m \theta \sin n \theta \right ) + \\ & \quad {q}_{2}{k}_{1}\left ( \cos m \theta \sin n \theta - \sin m \theta \cos n \theta \right ) + \\ & \quad {q}_{2}{k}_{2}\left ( \sin m \theta \sin n \theta + \cos m \theta \cos n \theta \right ) \\ & = {q}_{1}{k}_{1} \cos\left ( m-n \right ) \theta + {q}_{1}{k}_{2} \sin\left ( m-n \right ) \theta + {q}_{2}{k}_{1} \sin \left ( m-n \right ) \theta +{q}_{2}{k}_{2} \cos\left ( m-n \right ) \theta \end{aligned}
Re{[(q1+iq2)eimθ]×[(k1+ik2)einθ]}=Re{[(q1cosmθ−q2sinmθ)+i(q1sinmθ+q2cosmθ)]×[(k1cosnθ−k2sinnθ)−i(k1sinnθ+k2cosnθ)]}=[(q1cosmθ−q2sinmθ)×(k1cosnθ−k2sinnθ)]+[(q1sinmθ+q2cosmθ)×(k1sinnθ+k2cosnθ)]=q1k1(cosmθcosnθ+sinmθsinnθ)+q1k2(−cosmθsinnθ+cosmθsinnθ)+q2k1(cosmθsinnθ−sinmθcosnθ)+q2k2(sinmθsinnθ+cosmθcosnθ)=q1k1cos(m−n)θ+q1k2sin(m−n)θ+q2k1sin(m−n)θ+q2k2cos(m−n)θ
扩展到多维度情况:
[
cos
m
θ
0
−
sin
m
θ
0
0
0
⋯
0
0
sin
m
θ
0
cos
m
θ
0
0
0
⋯
0
0
0
0
cos
m
θ
1
−
sin
m
θ
1
⋯
0
0
0
0
sin
m
θ
1
cos
m
θ
1
⋯
0
0
⋮
⋮
⋮
⋮
⋱
⋮
⋮
0
0
0
0
⋯
cos
m
θ
d
/
2
−
1
−
sin
m
θ
d
/
2
−
1
0
0
0
0
⋯
sin
m
θ
d
/
2
−
1
cos
m
θ
d
/
2
−
1
]
[
q
0
q
1
q
2
q
3
⋮
q
d
−
2
q
d
−
1
]
\left[\begin{array}{ccccccc} \cos m \theta_{0} & -\sin m \theta_{0} & 0 & 0 & \cdots & 0 & 0 \\ \sin m \theta_{0} & \cos m \theta_{0} & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m \theta_{1} & -\sin m \theta_{1} & \cdots & 0 & 0 \\ 0 & 0 & \sin m \theta_{1} & \cos m \theta_{1} & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m \theta_{d / 2-1} & -\sin m \theta_{d / 2-1} \\ 0 & 0 & 0 & 0 & \cdots & \sin m \theta_{d / 2-1} & \cos m \theta_{d / 2-1} \end{array}\right]\left[\begin{array}{c} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{array}\right]
cosmθ0sinmθ000⋮00−sinmθ0cosmθ000⋮0000cosmθ1sinmθ1⋮0000−sinmθ1cosmθ1⋮00⋯⋯⋯⋯⋱⋯⋯0000⋮cosmθd/2−1sinmθd/2−10000⋮−sinmθd/2−1cosmθd/2−1
q0q1q2q3⋮qd−2qd−1
考虑到矩阵稀疏性:
[
q
0
q
1
q
2
q
3
⋮
q
d
−
2
q
d
−
1
]
⊗
[
cos
m
θ
0
cos
m
θ
0
cos
m
θ
1
cos
m
θ
1
⋮
cos
m
θ
d
/
2
−
1
cos
m
θ
d
/
2
−
1
]
+
[
−
q
1
q
0
−
q
3
q
2
⋮
−
q
d
−
1
q
d
−
2
]
⊗
[
sin
m
θ
0
sin
m
θ
0
sin
m
θ
1
sin
m
θ
1
⋮
sin
m
θ
d
/
2
−
1
sin
m
θ
d
/
2
−
1
]
\left[\begin{array}{c} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{array}\right] \otimes\left[\begin{array}{c} \cos m \theta_{0} \\ \cos m \theta_{0} \\ \cos m \theta_{1} \\ \cos m \theta_{1} \\ \vdots \\ \cos m \theta_{d / 2-1} \\ \cos m \theta_{d / 2-1} \end{array}\right]+\left[\begin{array}{c} -q_{1} \\ q_{0} \\ -q_{3} \\ q_{2} \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{array}\right] \otimes\left[\begin{array}{c} \sin m \theta_{0} \\ \sin m \theta_{0} \\ \sin m \theta_{1} \\ \sin m \theta_{1} \\ \vdots \\ \sin m \theta_{d / 2-1} \\ \sin m \theta_{d / 2-1} \end{array}\right]
q0q1q2q3⋮qd−2qd−1
⊗
cosmθ0cosmθ0cosmθ1cosmθ1⋮cosmθd/2−1cosmθd/2−1
+
−q1q0−q3q2⋮−qd−1qd−2
⊗
sinmθ0sinmθ0sinmθ1sinmθ1⋮sinmθd/2−1sinmθd/2−1
对于 token 序列中的每个词嵌入向量,首先计算其对应的 query 和 key 向量,然后对每个 token 位置都计算对应的旋转位置编码,接着对每个 token 位置的 query 和 key 向量的元素按照两两一组应用旋转变换,最后再计算 query 和 key 之间的内积得到 self-attention 的计算结果。
此处代码详解见:Meta最新模型LLaMA细节与代码详解
随着相对距离的变大,内积结果有衰减趋势的出现。
θ
i
=
1000
0
−
2
i
/
d
\theta _{i} = 10000^{-2i/d}
θi=10000−2i/d