大模型基础之位置编码RoPE

RoPE

RoPE(Rotary Position Embedding,旋转式位置编码)是一种配合Attention机制能达到“通过绝对位置编码的方式实现相对位置编码”的设计。在2021年2月由苏剑林提出,是现在的大模型最常用的位置编码。

q \mathbf{q} q k \mathbf{k} k为self-attention的q和k向量, j j j是元素索引。假设有 0 < θ ≤ π 2 N 0 < \theta\leq \frac{\pi}{2N} 0<θ2Nπ,N是最大序列长度。 i表示复数的虚数单位, ⟨ ⟩ \langle \rangle 为内积符号, z ‾ \overline{z} z表示复数z的共轭复数(注:有些地方会将共轭复数记作 z ∗ z^* z)。RoPE可表示为下述过程:
RoPE ( x , m ) = x e i m θ ⟨ RoPE ( q j , m ) , RoPE ( k j , n ) ⟩ = ⟨ q j e i m θ , k j e i n θ ⟩ = q j k j e i m θ e i n θ ‾ = q j k j e i ( m − n ) θ = RoPE ( q j k j , m − n ) \begin {aligned} \text{RoPE}(\mathbf{x}, m) &= \mathbf{x} e^{im\theta} \\ \langle \text{RoPE}(q_j, m), \text{RoPE}(k_j,n) \rangle &= \langle q_j e^{im\theta}, k_j e^{in\theta} \rangle \\ &= q_jk_j e^{im\theta} \overline{e^{in \theta}} \\ &= q_jk_j e^{i(m-n)\theta} \\ &= \text{RoPE}(q_j k_j, m-n) \end {aligned} RoPE(x,m)RoPE(qj,m),RoPE(kj,n)⟩=xeimθ=qjeimθ,kjeinθ=qjkjeimθeinθ=qjkjei(mn)θ=RoPE(qjkj,mn)

RoPE的示意图如下图(来自RoFormer论文)
在这里插入图片描述

准备知识

  • 复数的笛卡尔积形式(Cartesian form): z = a + i b z = a + ib z=a+ib

  • 复数的极坐标形式(polar form): z = r ( cos ⁡ θ + i sin ⁡ θ ) z=r(\cos \theta + i \sin \theta) z=r(cosθ+isinθ), 其中 r = ∣ z ∣ = a 2 + b 2 r=|z|=\sqrt{a^2 + b^2} r=z=a2+b2 θ = arg ⁡ ( x ) = tan ⁡ − 1 b a \theta = \arg(x) = \tan^{-1} \frac{b}{a} θ=arg(x)=tan1ab

  • 欧拉公式: e i x = cos ⁡ ( x ) + i sin ⁡ ( x ) e^{ix} = \cos(x) + i \sin(x) eix=cos(x)+isin(x)

  • 复数的指数形式(exponential form): z = r e i θ z = r \text{e}^{i\theta} z=reiθ, 其中 r = ∣ z ∣ = a 2 + b 2 r=|z|=\sqrt{a^2 + b^2} r=z=a2+b2 θ = arg ⁡ ( x ) = tan ⁡ − 1 b a \theta = \arg(x) = \tan^{-1} \frac{b}{a} θ=arg(x)=tan1ab

  • 在复数的指数形式下复数 z = r e i θ z = r \text{e}^{i\theta} z=reiθ的共轭复数为 z ‾ = r e − i θ \overline{z} = r \text{e}^{-i\theta} z=reiθ, 两个复数 z = r e i θ z = r \text{e}^{i\theta} z=reiθ w = t e i ϕ w = t \text{e}^{i\phi} w=teiϕ的乘积为 z w = r t e i ( θ + ϕ ) zw = rt \text{e}^{i(\theta + \phi)} zw=rtei(θ+ϕ)

  • 在复数的极坐标形式下两个复数 z = r ( cos ⁡ θ + i sin ⁡ θ ) z=r(\cos \theta + i \sin \theta) z=r(cosθ+isinθ) w = t ( cos ⁡ ϕ + i sin ⁡ ϕ ) w=t(\cos \phi + i \sin \phi) w=t(cosϕ+isinϕ)的乘积为 z w = r t ( cos ⁡ ( θ + ϕ ) + i sin ⁡ ( θ + ϕ ) ) zw=rt(\cos(\theta+\phi) + i\sin(\theta + \phi)) zw=rt(cos(θ+ϕ)+isin(θ+ϕ))

  • 复数 z = a + i b z = a + ib z=a+ib表示成矩阵时的形式为 ( a − b b a ) \left(\begin{array}{cc} a & -b \\ b & a \end{array}\right) (abba),而旋转矩阵形式为 [ cos ⁡ θ − sin ⁡ θ sin ⁡ θ cos ⁡ θ ] \left[\begin{array}{cc} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \end{array}\right] [cosθsinθsinθcosθ],所以复数乘法的几何意义为将向量逆时针旋转 θ \theta θ(也可以从复数极坐标形式的乘法来理解旋转)

    在这里插入图片描述

RoPE的推导

假设有函数f(x, l)给位置l处的元素x添加绝对位置信息,对于 q \mathbf{q} q k \mathbf{k} k用函数f编码后有:
q ~ m = f ( q , m ) , k ~ n = f ( k , n ) ( 1 ) \tilde{\mathbf{q}}_m = f(\mathbf{q}, m), \qquad \tilde{\mathbf{k}}_n = f(\mathbf{k}, n) \qquad (1) q~m=f(q,m),k~n=f(k,n)(1)
即我们希望为 q \mathbf{q} q k \mathbf{k} k设计函数 f ( ⋅ , m ) f(\cdot, m) f(,m) f ( ⋅ , n ) f(\cdot, n) f(,n),使用函数编码后, q ~ m \tilde{\mathbf{q}}_m q~m k ~ n \tilde{\mathbf{k}}_n k~n带有了位置m和n的绝对位置信息。因为self-attention的核心运算是内积,我们希望 q \mathbf{q} q k \mathbf{k} k内积的结果有相对位置信息,即我们假设存在有如下恒等关系:
⟨ f ( q , m ) , f ( k , n ) ⟩ = g ( q , k , m − n ) ( 2 ) \langle f(\mathbf{q},m),f(\mathbf{k}, n) \rangle = g(\mathbf{q}, \mathbf{k}, m-n) \qquad (2) f(q,m),f(k,n)⟩=g(q,k,mn)(2)
我们的目标是求出这个恒等式的一个解。先将求解过程中的初始条件设为 f ( q , 0 ) = q f(\mathbf{q}, 0) = \mathbf{q} f(q,0)=q f ( k , 0 ) = k f(\mathbf{k}, 0) = \mathbf{k} f(k,0)=k,这可以理解为没有位置编码信息加入的情形。

我们先考虑二维情形,并借助复数来求解。用复数的指数形式来表示函数:

f ( q , m ) = R f ( q , m ) e i Θ f ( q , m ) ( 3 a ) f ( k , n ) = R f ( k , n ) e i Θ f ( k , n ) ( 3 b ) g ( q , k , m − n ) = R g ( q , k , m − n ) e i Θ g ( q , k , m − n ) ( 3 c ) \begin{align*} f(\mathbf{q}, m) &= R_f(\mathbf{q}, m)e^{i\Theta_f(\mathbf{q}, m)} \qquad (3a)\\ f(\mathbf{k}, n) &= R_f(\mathbf{k}, n)e^{i\Theta_f(\mathbf{k}, n)} \qquad (3b) \\ g(\mathbf{q}, \mathbf{k}, m - n) &= R_g(\mathbf{q}, \mathbf{k}, m - n)e^{i\Theta_g(\mathbf{q}, \mathbf{k}, m - n)} \qquad (3c) \end{align*} f(q,m)f(k,n)g(q,k,mn)=Rf(q,m)eiΘf(q,m)(3a)=Rf(k,n)eiΘf(k,n)(3b)=Rg(q,k,mn)eiΘg(q,k,mn)(3c)

上式中的 R f R_f Rf R g R_g Rg 是f和g的径向分量(radial component), Θ f \Theta_f Θf Θ g \Theta_g Θg是f和g的幅角分量(angular components)。将它们代到恒等表达式(2)中,可以得到:

R f ( q , m ) R f ( k , n ) = R g ( q , k , m − n ) ( 4 a ) Θ f ( q , m ) − Θ f ( k , n ) = Θ g ( q , k , m − n ) ( 4 b ) \begin{align*} R_f(\mathbf{q}, m) R_f(\mathbf{k}, n) &= R_g(\mathbf{q}, \mathbf{k}, m - n) \qquad (4a) \\ \Theta_f(\mathbf{q}, m) - \Theta_f(\mathbf{k}, n) &= \Theta_g(\mathbf{q}, \mathbf{k}, m - n) \qquad (4b) \\ \end{align*} Rf(q,m)Rf(k,n)Θf(q,m)Θf(k,n)=Rg(q,k,mn)(4a)=Θg(q,k,mn)(4b)

由初始条件 f ( q , 0 ) = q f(\mathbf{q}, 0) = \mathbf{q} f(q,0)=q f ( k , 0 ) = k f(\mathbf{k}, 0) = \mathbf{k} f(k,0)=k我们有( ∣ ∣ q ∣ ∣ ||\mathbf{q}|| ∣∣q∣∣ ∣ ∣ k ∣ ∣ ||\mathbf{k}|| ∣∣k∣∣ θ q \theta_q θq θ k \theta_k θk是向量 q \mathbf{q} q k \mathbf{k} k在二维平面上的径向和幅角分量):
q = ∣ ∣ q ∣ ∣ e i θ q = R q ( q , 0 ) e i Θ q ( q , 0 ) k = ∣ ∣ k ∣ ∣ e i θ k = R q ( k , 0 ) e i Θ k ( k , 0 ) ( 5 ) \begin{aligned} \mathbf{q} = ||\mathbf{q}|| e^{i\theta_q}=R_q(\mathbf{q}, 0) e^{i \Theta_q(\mathbf{q}, 0)} \\ \mathbf{k} = ||\mathbf{k}|| e^{i\theta_k}=R_q(\mathbf{k}, 0) e^{i \Theta_k(\mathbf{k}, 0)} \end{aligned} \qquad (5) q=∣∣q∣∣eiθq=Rq(q,0)eiΘq(q,0)k=∣∣k∣∣eiθk=Rq(k,0)eiΘk(k,0)(5)
设m=n,以及考虑到初始条件 f ( x , 0 ) = x f(\mathbf{x},0)=\mathbf{x} f(x,0)=x, 由式(4a)可以得到:
R f ( q , m ) R f ( k , m ) = R g ( q , k , 0 ) = R f ( q , 0 ) R f ( k , 0 ) = ∥ q ∥ ∥ k ∥ ( 6 ) R_f(\mathbf{q}, m) R_f(\mathbf{k}, m) = R_g(\mathbf{q}, \mathbf{k}, 0) = R_f(\mathbf{q}, 0) R_f(\mathbf{k}, 0) = \parallel\mathbf{q}\parallel \parallel \mathbf{k}\parallel \qquad (6) \\ Rf(q,m)Rf(k,m)=Rg(q,k,0)=Rf(q,0)Rf(k,0)=∥q∥∥k(6)

由上式我们可以有 R f ( q , m ) = R f ( q , 0 ) = ∣ ∣ q ∣ ∣ R_f(\mathbf{q}, m)=R_f(\mathbf{q}, 0)=|| \mathbf{q} || Rf(q,m)=Rf(q,0)=∣∣q∣∣ R f ( k , m ) = R f ( k , 0 ) = ∣ ∣ k ∣ ∣ R_f(\mathbf{k}, m)=R_f(\mathbf{k}, 0)=|| \mathbf{k} || Rf(k,m)=Rf(k,0)=∣∣k∣∣ R g ( q , k , m − n ) = R g ( q , k , 0 ) = ∣ ∣ q ∣ ∣ ∣ ∣ k ∣ ∣ R_g(\mathbf{q}, \mathbf{k}, m-n)=R_g(\mathbf{q}, \mathbf{k}, 0)=|| \mathbf{q} || || \mathbf{k} || Rg(q,k,mn)=Rg(q,k,0)=∣∣q∣∣∣∣k∣∣ R f R_f Rf R g R_g Rg不依赖于位置信息。

类似地,设m=n,以及考虑到初始条件 Θ ( x , 0 ) = Θ ( x ) \Theta(\mathbf{x},0)=\Theta(\mathbf{x}) Θ(x,0)=Θ(x),由式(4b)可以得到( Θ ( q ) \Theta(\mathbf{q}) Θ(q) Θ ( k ) \Theta(\mathbf{k}) Θ(k)是向量 q \mathbf{q} q k \mathbf{k} k的幅角):
Θ f ( q , m ) − Θ f ( k , m ) = Θ g ( q , k , 0 ) = Θ f ( q , 0 ) − Θ f ( k , 0 ) = Θ ( q ) − Θ ( k ) ( 7 ) \Theta_f(\mathbf{q}, m) - \Theta_f(\mathbf{k}, m) = \Theta_g(\mathbf{q}, \mathbf{k}, 0) = \Theta_f(\mathbf{q}, 0) - \Theta_f(\mathbf{k}, 0) = \Theta(\mathbf{q}) - \Theta(\mathbf{k}) \qquad (7) Θf(q,m)Θf(k,m)=Θg(q,k,0)=Θf(q,0)Θf(k,0)=Θ(q)Θ(k)(7)
将上式的第一项和最后一项移位我们可得 Θ f ( q , m ) − Θ ( q ) = Θ f ( k , m ) − Θ ( k ) \Theta_f(\mathbf{q}, m)- \Theta(\mathbf{q}) = \Theta_f(\mathbf{k}, m) - \Theta(\mathbf{k}) Θf(q,m)Θ(q)=Θf(k,m)Θ(k),所以 Θ f ( q , m ) − Θ ( q ) \Theta_f(\mathbf{q}, m)- \Theta(\mathbf{q}) Θf(q,m)Θ(q)是一个只与m有关与 q \mathbf{q} q无关的函数,将其记为 φ ( m ) \varphi(m) φ(m),则有 Θ f ( q , m ) = Θ ( q ) + φ ( m ) \Theta_f(\mathbf{q}, m)= \Theta(\mathbf{q}) + \varphi(m) Θf(q,m)=Θ(q)+φ(m)。令n=m-1,将其代入到式(4b)并移项可得
φ ( m ) − φ ( m − 1 ) = Θ g ( q , k , 1 ) + Θ ( k ) − Θ ( q ) \varphi(m)-\varphi(m-1) = \Theta_g(\mathbf{q}, \mathbf{k}, 1) + \Theta(\mathbf{k}) - \Theta(\mathbf{q}) φ(m)φ(m1)=Θg(q,k,1)+Θ(k)Θ(q)
因为上式右侧与m无关,所以上式左侧也必须与m无关,因此 φ \varphi φ是一个等差数列(arithmetic progression),如果我们设等差数列的初始值 φ ( 0 ) = 0 \varphi(0)=0 φ(0)=0 φ ( 1 ) = θ \varphi(1)=\theta φ(1)=θ,那么就可得 φ ( m ) = m θ \varphi(m) = m \theta φ(m)=mθ

综上,我们得到了二维情况下用复数表示的RoPE, 它是满足恒等式2的一个解:
f ( q , m ) = R f ( q , m ) e i Θ f ( q , m ) = ∣ ∣ q ∣ ∣ e i ( Θ ( q ) + m θ ) = q e i m θ f(\mathbf{q}, m) = R_f(\mathbf{q}, m)e^{i\Theta_f(\mathbf{q}, m)}=||\mathbf{q}||e^{i(\Theta(\mathbf{q})+m\mathbf{\theta})} = \mathbf{q} e^{im \theta} f(q,m)=Rf(q,m)eiΘf(q,m)=∣∣q∣∣ei(Θ(q)+mθ)=qeimθ
根据复数乘法的几何意义,这个变换对应着向量的旋转,所以RoPE作者将其称之为”旋转式位置编码“。

将上式表示成矩阵形式:
f ( q , m ) = ( cos ⁡ m θ − sin ⁡ m θ sin ⁡ m θ cos ⁡ m θ ) ( q 0 q 1 ) \boldsymbol{f}(\boldsymbol{q}, m)=\left(\begin{array}{cc} \cos m \theta & -\sin m \theta \\ \sin m \theta & \cos m \theta \end{array}\right)\binom{q_0}{q_1} f(q,m)=(cosmθsinmθsinmθcosmθ)(q1q0)
由于内积满足线性叠加性,所以任意偶数维的RoPE都可以表示为二维情形的拼接。
f ( q , m ) = ( M 0 M 1 ⋱ M d / 2 − 1 ) ( q 0 q 1 ⋮ q d − 1 ) = R m q m = R m W q x m f(\mathbf{q}, m) = \begin{pmatrix} M_0 & & & \\ & M_1 & & \\ & & \ddots & \\ & & & M_{d/2-1} \end{pmatrix} \begin{pmatrix} q_0\\ q_1\\ \vdots\\ q_{d-1} \end{pmatrix} = \mathbf{R_m q_m} = \mathbf{R_m W_q x_m} f(q,m)= M0M1Md/21 q0q1qd1 =Rmqm=RmWqxm
上式中 M j = ( cos ⁡ m θ j − sin ⁡ m θ j s i n m θ j cos ⁡ m θ j ) M_j=\begin{pmatrix}\cos m\theta_j & -\sin m\theta_j \\sin m\theta_j & \cos m\theta_j\end{pmatrix} Mj=(cosmθjsinmθjsinmθjcosmθj) Θ = { θ i = 1000 0 − 2 ( i − 1 ) / d , i ∈ [ 0 , 1 , 2 , … , d / 2 − 1 ] } \Theta = \{\theta_i=10000^{-2(i-1)/d}, i \in[0,1,2, \ldots, d/2-1] \} Θ={θi=100002(i1)/d,i[0,1,2,,d/21]} R m \mathbf{R_m} Rm是对角旋转矩阵,它是一个正交矩阵, W q \mathbf{W_q} Wq是待学习的query权重, x m \mathbf{x_m} xm则是m处的token的embedding。

也就是说,给位置m的向量 q \mathbf{q} q乘上矩阵 R m \mathbf{R_m} Rm,位置为n的向量 k \mathbf{k} k乘上矩阵 R n \mathbf{R_n} Rn, 用变换后的序列做Attention,Attention就自动包含相对位置了,因为有如下恒等式:
( R m q ) T ( R n k ) = q T R m T R n k = q T R m − n k (\mathbf{R_m q})^T(\mathbf{R_n k}) = \mathbf{q^T R_m^T R_n k} = \mathbf{q^T R_{m-n} k} (Rmq)T(Rnk)=qTRmTRnk=qTRmnk
因为 R m \mathbf{R_m} Rm的稀疏性,直接用矩阵乘法来实现很浪费算法,所以RoPE作者推荐用如下方式来实现RoPE:
( q 0 q 1 q 2 q 3 ⋮ q d − 2 q d − 1 ) ⊗ ( cos ⁡ m θ 0 cos ⁡ m θ 0 cos ⁡ m θ 1 cos ⁡ m θ 1 ⋮ cos ⁡ m θ d / 2 − 1 cos ⁡ m θ d / 2 − 1 ) + ( − q 1 q 0 − q 3 q 2 ⋮ − q d − 1 q d − 2 ) ⊗ ( sin ⁡ m θ 0 sin ⁡ m θ 0 sin ⁡ m θ 1 sin ⁡ m θ 1 ⋮ sin ⁡ m θ d / 2 − 1 sin ⁡ m θ d / 2 − 1 ) \left(\begin{array}{c} q_0 \\ q_1 \\ q_2 \\ q_3 \\ \vdots \\ q_{d-2} \\ q_{d-1} \end{array}\right) \otimes\left(\begin{array}{c} \cos m \theta_0 \\ \cos m \theta_0 \\ \cos m \theta_1 \\ \cos m \theta_1 \\ \vdots \\ \cos m \theta_{d / 2-1} \\ \cos m \theta_{d / 2-1} \end{array}\right)+\left(\begin{array}{c} -q_1 \\ q_0 \\ -q_3 \\ q_2 \\ \vdots \\ -q_{d-1} \\ q_{d-2} \end{array}\right) \otimes\left(\begin{array}{c} \sin m \theta_0 \\ \sin m \theta_0 \\ \sin m \theta_1 \\ \sin m \theta_1 \\ \vdots \\ \sin m \theta_{d / 2-1} \\ \sin m \theta_{d / 2-1} \end{array}\right) q0q1q2q3qd2qd1 cosmθ0cosmθ0cosmθ1cosmθ1cosmθd/21cosmθd/21 + q1q0q3q2qd1qd2 sinmθ0sinmθ0sinmθ1sinmθ1sinmθd/21sinmθd/21
上式中的 ⊗ \otimes 是逐位对应相乘,是Numpy等中的*运算。 从这个实现也可以来解释为什么RoPE可以视为是乘性位置编码的变体。

RoPE的代码实现

按照上面RoPE作者推荐的方式实现RoPE的示例如下(来自参考资料4)

import torch
import math

def rotary_position_embedding(q, k):
    """
    Rotary Position Embedding (RoPE) for queries and keys.
    
    Args:
        q: tensor for queries of shape (batch_size, num_heads, seq_len, dim)
        k: tensor for keys of shape (batch_size, num_heads, seq_len, dim)
        
    Returns:
        Rotated queries and keys
    """
    batch_size, num_heads, seq_len, dim = q.size()
    
    # Begin of sinusoidal_position_embedding content
    # 序列对应的位置序号
    position = torch.arange(seq_len, dtype=torch.float).unsqueeze(-1).to(q.device)
    # q维度上的theta值
    div_term = torch.exp(torch.arange(0, dim, 2, dtype=torch.float) * -(math.log(10000.0) / dim)).to(q.device)
    
    pos_emb = position * div_term
    pos_emb = torch.stack([torch.sin(pos_emb), torch.cos(pos_emb)], dim=-1).flatten(-2, -1)
    pos_emb = pos_emb.unsqueeze(0).unsqueeze(1)
    pos_emb = pos_emb.expand(batch_size, num_heads, -1, -1)
    # End of sinusoidal_position_embedding content

    # Extract and duplicate cosine and sine embeddings
    cos_emb = pos_emb[..., 1::2].repeat_interleave(2, dim=-1)
    sin_emb = pos_emb[..., ::2].repeat_interleave(2, dim=-1)

    # Create alternate versions of q and k
    q_alternate = torch.stack([-q[..., 1::2], q[..., ::2]], dim=-1).reshape(q.size())
    k_alternate = torch.stack([-k[..., 1::2], k[..., ::2]], dim=-1).reshape(k.size())

    # Rotate queries and keys
    q_rotated = q * cos_emb + q_alternate * sin_emb
    k_rotated = k * cos_emb + k_alternate * sin_emb

    return q_rotated, k_rotated

llama实现RoPE的方式是先将向量转到复数域,再对两个向量进行旋转,接着将向量转回到实数域。

# 以长度为4,dim维度为6的q示意llama是如何实现RoPE的
q = torch.tensor([[1, 2, 4, 5, 6, 7], [1, 2, 5, 6, 7, 8], [2, 5, 4, 6, 7, 8], [1, 3, 5, 6, 7, 9]])
seq_len, dim = q.shape  # [4,6]

# 将q在其embedding维度分为一对一对的形式
q_per_token_split_into_pairs = q.float().view(q.shape[0], -1, 2)
q_per_token_split_into_pairs.shape  # [4,3,2]

# 计算复数域的cos和sin的频率
zero_to_one_split_into_dim_parts = torch.tensor(range(dim//2))/(dim//2)
rope_theta = 10000.0
freqs = 1.0 / (rope_theta ** zero_to_one_split_into_dim_parts)
freqs_for_each_token = torch.outer(torch.arange(seq_len), freqs)
freqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)
freqs_cis

# 将q转到复数域
q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)

# 进行dot product来按位置旋转q向量
q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis
q_per_token_as_complex_numbers_rotated

# 将旋转了的q向量转回到实数域
q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)
# 将维度还原
q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q.shape)

参考资料

  1. RoPE论文:Su, Jianlin, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. 2021. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” Cornell University - arXiv,Cornell University - arXiv, April.

  2. RoPE作者苏剑林的博客: 让研究人员绞尽脑汁的Transformer位置编码Transformer升级之路:2、博采众长的旋转式位置编码

  3. eleuther.ai关于RoPE的博客

  4. 知乎文章 位置编码|RoPE|ALiBi

  5. 复数 wikipedia, 旋转矩阵wikipedia

  6. rotary_embedding-torch github

  7. llama3 from scratch

  • 14
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值