RoPE和大模型外推

暮海星辰

已于 2024-04-17 21:58:38 修改

阅读量813

点赞数 24

文章标签：算法

于 2024-04-16 12:59:49 首次发布

本文链接：https://blog.csdn.net/haixiao0720/article/details/137812722

版权

RoPE和大模型外推NTK

绝对位置编码
正旋位置编码
- 优势：
- 缺点：
旋转位置编码（RoPE）
- 基础公式
NTK(待续)

绝对位置编码

Bert、ALBert等模型使用
绝对位置编码每个位置训练一个embedding向量，尺寸为max_position_embeddings和hidden_size

#初始化
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
# 计算位置嵌入
position_embeddings = self.position_embeddings(position_ids)
embeddings += position_embeddings

优势：不考虑外推的情况下，效果较好。
缺点：不具备外推性，max_position_embeddings在pretrain的时候就固定了。

正旋位置编码

基于Sinusoidal的位置编码由transform文章提出，公式如下
$\begin{array}{l} P E_{(p o s, 2 i)}=\sin \left(p o s / base^{2 i / d}\right) \\ P E_{(p o s, 2 i+1)}=\cos \left(p o s / base^{2 i / d}\right) \end{array}$
POS是位置索引，i 是隐向量维度索引。一般base=10000，随着base的变大，周期会明显变长。

优势：

1、具有相对位置表达能力：Sinusoidal可以学习到相对位置，对于固定位置距离的k，PE(i+k)可以表示成PE(i)的线性函数。
$\begin{array}{l} \begin{array}{l} P E(t, 2 i)=\sin \left(t * w_{2 i}\right) \\ P E(t, 2 i+1)=\cos \left(t * w_{2 i}\right) \\ w_{2 i}=1 / 10000^{2 i / d} \end{array}\\ \begin{aligned} P E(t+k, 2 i) & =\sin \left(t * w_{2 i}+k * w_{2 i}\right) \\ & =\sin \left(t * w_{2 i}\right) \cos \left(k * w_{2 i}\right)+\cos \left(t * w_{2 i}\right) \sin \left(k * w_{2 i}\right) \\ & =P E(t, 2 i) P E(k, 2 i+1)+P E(t, 2 i+1) P E(k, 2 i) \\ & =P E(t, 2 i) u+P E(t, 2 i+1) v \end{aligned}\\ \begin{aligned} P E(t+k, 2 i+1) & =\cos \left(t * w_{2 i}+k * w_{2 i}\right) \\ & =\cos \left(t * w_{2 i}\right) \cos \left(k * w_{2 i}\right)-\sin \left(t * w_{2 i}\right) \sin \left(k * w_{2 i}\right) \\ & =P E(t, 2 i+1) P E(k, 2 i+1)-P E(t, 2 i) P E(k, 2 i) \\ & =P E(t, 2 i+1) u-P E(t, 2 i) v \end{aligned} \end{array}$
2、两个位置向量的内积只和相对位置 k 有关。
$\begin{aligned} P E(t) P E(t+k) & =\sum_{i=0}^{d / 2-1} P E(t, 2 i) P E(t+k, 2 i)+\sum_{i=0}^{d / 2-1} P E(t, 2 i+1) P E(t+k, 2 i+1) \\ & =\sum_{i=0}^{d / 2-1} \sin \left(t * w_{2 i}\right) \sin \left[(t+k) * w_{2 i}\right]+\sum_{i=0}^{d / 2-1} \cos \left(t * w_{2 i}\right) \cos \left[(t+k) * w_{2 i}\right] \\ & =\sum_{i=0}^{d / 2-1} \cos \left(k * w_{2 i}\right) \end{aligned}$
3、随着k的增加，内积的结果会直接减少，即会存在远程衰减。
在这里插入图片描述

缺点：

但是实际的Attention计算中还需要与attention的权重W相乘。考虑W矩阵后，内积不再只和相对距离k相关了。

旋转位置编码（RoPE）

旋转位置编码可以有效解决Sinusoidal的位置编码内积和相对位置的关系，其Q和K的内积只和相对位置有关。

基础公式

三角函数和差公式：

$\begin{array}{l} sin(A+B)=sinAcosB+cosAsinB\\ sin(A-B)=sinAcosB-sinBcosA\\ cos(A+B)=cosAcosB-sinAsinB\\ cos(A-B)=cosAcosB+sinAsinB\\ tan(A+B)=(tanA+tanB)/(1-tanAtanB)\\ tan(A-B)=(tanA-tanB)/(1+tanAtanB) \\ cot(A+B)=(cotAcotB-1)/(cotB+cotA)\\ cot(A-B)=(cotAcotB+1)/(cotB-cotA) \\ \end{array}$

欧拉公式

$e^{ix}=\cos x+i\sin x$

共轭复数计算

$\begin{array}{l} \overline{\alpha + \beta } = \overline{\alpha} + \overline{\beta}\\ \overline{\alpha \times \beta} = \overline{\alpha} \times \overline{\beta} \end{array}$

旋转推导

希望找到一种变换使得以下条件成立
$<f_{q}\left(\boldsymbol{q}_{m}, m\right), f_{k}\left(\boldsymbol{k}_{n}, n\right)>=g\left(\boldsymbol{q}_{m}, \boldsymbol{k}_{n}, m-n\right)$
其中 ${x}_{m}$ 是第m位置的隐向量。即QK向量经过某种变换后，其新生成向量的内积只和QK原始向量以及m-n相关。
以下给出旋转位置编码的二维表达形式
$\begin{array}{l} f_{q}\left(\boldsymbol{q}_{m}, m\right)=\boldsymbol{q}_{m} e^{i m \theta} \\ f_{k}\left(\boldsymbol{k}_{n}, n\right)=\boldsymbol{k}_{n} e^{i n \theta} \\ \begin{aligned} g\left(\boldsymbol{q}_{m}, \boldsymbol{k}_{n}, m-n\right)&=\operatorname{Re}\left [ f_{q}\left(\boldsymbol{q}_{m}, m\right) \times \overline{f_{k}\left(\boldsymbol{k}_{n}, n\right)} \right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i m \theta} \times \overline{\boldsymbol{k}_{n}} \overline{ e^{i n \theta}} \right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i m \theta} e^{i -n \theta} \overline{\boldsymbol{k}_{n}}\right ] \\ &=\operatorname{Re}\left [ \boldsymbol{q}_{m} e^{i (m-n) \theta} \overline{\boldsymbol{k}_{n}}\right ] \end{aligned} \end{array}$
以下给出矩阵表达形式推导， ${q}_{m}$ 只有2个元素，表示为复数形式为 ${q}_{1}+ i {q}_{2})$
$\begin{array}{l} ({q}_{1}+i {q}_{2}) e^{i m \theta}=({q}_{1} \cos m \theta- {q}_{2} \sin m \theta)+i({q}_{1} \sin m \theta+ {q}_{2} \cos m \theta)\\ ({k}_{1}+i {k}_{2}) e^{i m \theta}=({k}_{1} \cos m \theta- {k}_{2} \sin m \theta)+i({k}_{1} \sin m \theta+ {k}_{2} \cos m \theta)\\ \end{array}$
$\begin{aligned} &\operatorname{Re}\left \{ \left [({q}_{1}+i {q}_{2}) e^{i m \theta} \right ] \times \overline{\left [ ({k}_{1}+i {k}_{2}) e^{i n \theta} \right ]} \right \} \\ & = \operatorname{Re} \left \{ \left [ ({q}_{1} \cos m \theta- {q}_{2} \sin m \theta)+i({q}_{1} \sin m \theta+ {q}_{2} \cos m \theta) \right ] \times \left [ ({k}_{1} \cos n \theta- {k}_{2} \sin n \theta)-i({k}_{1} \sin n \theta+ {k}_{2} \cos n \theta) \right ] \right \}\\ & = \left [ \left ( {q}_{1} \cos m \theta- {q}_{2} \sin m \theta \right ) \times \left ( {k}_{1} \cos n \theta- {k}_{2} \sin n \theta \right ) \right ] + \left [ \left ( {q}_{1} \sin m \theta+ {q}_{2} \cos m \theta \right ) \times \left ( {k}_{1} \sin n \theta+ {k}_{2} \cos n \theta \right ) \right ]\\ & = {q}_{1}{k}_{1}\left ( \cos m \theta \cos n \theta + \sin m \theta \sin n \theta \right ) + \\ & \quad {q}_{1}{k}_{2}\left ( - \cos m \theta \sin n \theta + \cos m \theta \sin n \theta \right ) + \\ & \quad {q}_{2}{k}_{1}\left ( \cos m \theta \sin n \theta - \sin m \theta \cos n \theta \right ) + \\ & \quad {q}_{2}{k}_{2}\left ( \sin m \theta \sin n \theta + \cos m \theta \cos n \theta \right ) \\ & = {q}_{1}{k}_{1} \cos\left ( m-n \right ) \theta + {q}_{1}{k}_{2} \sin\left ( m-n \right ) \theta + {q}_{2}{k}_{1} \sin \left ( m-n \right ) \theta +{q}_{2}{k}_{2} \cos\left ( m-n \right ) \theta \end{aligned}$
扩展到多维度情况：
$\left[\begin{array}{ccccccc} \cos m \theta_{0} & -\sin m \theta_{0} & 0 & 0 & \cdots & 0 & 0 \\ \sin m \theta_{0} & \cos m \theta_{0} & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & \cos m \theta_{1} & -\sin m \theta_{1} & \cdots & 0 & 0 \\ 0 & 0 & \sin m \theta_{1} & \cos m \theta_{1} & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & \cos m \theta_{d / 2-1} & -\sin m \theta_{d / 2-1} \\ 0 & 0 & 0 & 0 & \cdots & \sin m \theta_{d / 2-1} & \cos m \theta_{d / 2-1} \end{array}\right]\left[\begin{array}{c} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{array}\right]$
考虑到矩阵稀疏性：
$\left[\begin{array}{c} q_{0} \\ q_{1} \\ q_{2} \\ q_{3} \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{array}\right] \otimes\left[\begin{array}{c} \cos m \theta_{0} \\ \cos m \theta_{0} \\ \cos m \theta_{1} \\ \cos m \theta_{1} \\ \vdots \\ \cos m \theta_{d / 2-1} \\ \cos m \theta_{d / 2-1} \end{array}\right]+\left[\begin{array}{c} -q_{1} \\ q_{0} \\ -q_{3} \\ q_{2} \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{array}\right] \otimes\left[\begin{array}{c} \sin m \theta_{0} \\ \sin m \theta_{0} \\ \sin m \theta_{1} \\ \sin m \theta_{1} \\ \vdots \\ \sin m \theta_{d / 2-1} \\ \sin m \theta_{d / 2-1} \end{array}\right]$
对于 token 序列中的每个词嵌入向量，首先计算其对应的 query 和 key 向量，然后对每个 token 位置都计算对应的旋转位置编码，接着对每个 token 位置的 query 和 key 向量的元素按照两两一组应用旋转变换，最后再计算 query 和 key 之间的内积得到 self-attention 的计算结果。
此处代码详解见：Meta最新模型LLaMA细节与代码详解
随着相对距离的变大，内积结果有衰减趋势的出现。
$\theta _{i} = 10000^{-2i/d}$