学习笔记 位置编码 Position Embedding

Position Encodings in Hung-yi Lee

1. Attention is All you Need

p i ( 1 ) [ j ] = { sin ⁡ ( i ⋅ c j d )  if  j  is even  cos ⁡ ( i ⋅ c j − 1 d )  if  j  is odd  \boldsymbol{p}_{i}^{(1)}[j]= \begin{cases}\sin \left(i \cdot c^{\frac{j}{d}}\right) & \text { if } j \text { is even } \\ \cos \left(i \cdot c^{\frac{j-1}{d}}\right) & \text { if } j \text { is odd }\end{cases} pi(1)[j]=sin(icdj)cos(icdj1) if j is even  if j is odd 

2. Universal transformers

P t ∈ R m × d P^{t} \in \mathbb{R}^{m \times d} PtRm×d above are fixed, constant, two-dimensional (position, time) coordinate embeddings, obtained by computing the sinusoidal position embedding vectors as defined in (Vaswani et al., 2017) for the positions 1 ≤ i ≤ m 1 \leq i \leq m 1im and the time-step 1 ≤ t ≤ T 1 \leq t \leq T 1tT separately for each vector-dimension 1 ≤ j ≤ d 1 \leq j \leq d 1jd, and summing:

P i , 2 j t = sin ⁡ ( i / 1000 0 2 j / d ) + sin ⁡ ( t / 1000 0 2 j / d ) P i , 2 j + 1 t = cos ⁡ ( i / 1000 0 2 j / d ) + cos ⁡ ( t / 1000 0 2 j / d ) . \begin{aligned} P_{i, 2 j}^{t} &=\sin \left(i / 10000^{2 j / d}\right)+\sin \left(t / 10000^{2 j / d}\right) \\ P_{i, 2 j+1}^{t} &=\cos \left(i / 10000^{2 j / d}\right)+\cos \left(t / 10000^{2 j / d}\right) . \end{aligned} Pi,2jtPi,2j+1t=sin(i/100002j/d)+sin(t/100002j/d)=cos(i/100002j/d)+cos(t/100002j/d).

or
p i ( n ) [ j ] = { sin ⁡ ( i ⋅ c j d ) + sin ⁡ ( n ⋅ c j d )  if  j  is even  cos ⁡ ( i ⋅ c j − 1 d ) + cos ⁡ ( n ⋅ c j − 1 d )  if  j  is odd  \boldsymbol{p}_{i}^{(n)}[j]= \begin{cases}\sin \left(i \cdot c^{\frac{j}{d}}\right)+\sin \left(n \cdot c^{\frac{j}{d}}\right) & \text { if } j \text { is even } \\ \cos \left(i \cdot c^{\frac{j-1}{d}}\right)+\cos \left(n \cdot c^{\frac{j-1}{d}}\right) & \text { if } j \text { is odd }\end{cases} pi(n)[j]=sin(icdj)+sin(ncdj)cos(icdj1)+cos(ncdj1) if j is even  if j is odd 

3. Learnable Position Encodings

4. Learning to encode position for transformer with continuous dynamical model

p ( n ) ( t ) = p ( n ) ( s ) + ∫ s t h ( n ) ( τ , p ( n ) ( τ ) ; θ h ( n ) ) d τ \boldsymbol{p}^{(n)}(t)=\boldsymbol{p}^{(n)}(s)+\int_{s}^{t} \boldsymbol{h}^{(n)}\left(\tau, \boldsymbol{p}^{(n)}(\tau) ; \boldsymbol{\theta}_{h}^{(n)}\right) \mathrm{d} \tau p(n)(t)=p(n)(s)+sth(n)(τ,p(n)(τ);θh(n))dτ

image-20220522111050935

Position Encodings in Papers with Code

1. Absolute Position Encodings

Absolute Position Encodings are a type of position embeddings for [Transformer-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension d model  d_{\text {model }} dmodel  as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:
P E (  pos  , 2 i ) = sin ⁡ ( p o s / 1000 0 2 i / d m o d d ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s / 1000 0 2 i / d m o d d ) \begin{gathered} \mathrm{PE}(\text { pos }, 2 i)=\sin \left(p o s / 10000^{2 i / d_{m o d d}}\right) \\ \mathrm{PE}(p o s, 2 i+1)=\cos \left(p o s / 10000^{2 i / d_{m o d d}}\right) \end{gathered} PE( pos ,2i)=sin(pos/100002i/dmodd)PE(pos,2i+1)=cos(pos/100002i/dmodd)
where pos is the position and i i i is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2 π 2 \pi 2π to 10000 ⋅ 2 π 10000\cdot 2 \pi 100002π. This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k , P E pos  + k k, \mathrm{PE}_{\text {pos }+k} k,PEpos +k can be represented as a linear function of P E pos  \mathrm{PE}_{\text {pos }} PEpos .

在这里插入图片描述

2. Relative Position Encodings

Relative Position Encodings are a type of position embeddings for Transformer-based models that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys
e i j = x i W Q ( x j W K + a i j K ) T d z e_{i j}=\frac{x_{i} W^{Q}\left(x_{j} W^{K}+a_{i j}^{K}\right)^{T}}{\sqrt{d_{z}}} eij=dz xiWQ(xjWK+aijK)T
Here a a a is an edge representation for the inputs x i x_{i} xi and x j x_{j} xj. The softmax operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:
z i = ∑ j = 1 n α i j ( x j W V + a i j V ) z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}+a_{i j}^{V}\right) zi=j=1nαij(xjWV+aijV)
In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.

在这里插入图片描述

3. Rotary Position Embedding

Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

img

4. Conditional Positional Encoding

Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.

5. Attention with Linear Biases

ALiBi, or Attention with Linear Biases, is an alternative to position embeddings for inference extrapolation in Transformer models. When computing the attention scores for each head, the ALiBi method adds a constant bias to each attention score ( q i ⋅ k j , left ) \left(\mathbf{q}_{i} \cdot \mathbf{k}_{j}, \text{left} \right) (qikj,left) As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is left unmodified. m is a head-specific scalar that is set and not learned throughout training. When using ALiBi no positional embeddings are added at the bottom of the network.

参考资料:

[1] Self-Attention and Positional Encoding

[2] How Positional Embeddings work in Self-Attention (code in Pytorch) | AI Summer (theaisummer.com)

[3] An Overview of Position Embeddings | Papers With Code

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Position embedding 是一种将序列中每个位置的信息编码为向量的技术,用于处理自然语言处理(NLP)中的序列数据。这种技术能够捕捉到序列中每个位置的相对位置信息,这对于处理自然语言处理任务中的序列数据非常重要。以下是一些可学习position embedding 技术: 1. 基于嵌入层的位置编码Embedding-Based Position Encoding):这种方法使用一个可学习位置嵌入层,将位置信息编码为向量。该嵌入层将位置信息转换为向量,然后将其与词嵌入向量相加,形成一个位置编码向量。 2. 基于注意力机制的位置编码(Attention-Based Position Encoding):这种方法使用注意力机制来为每个位置生成一个位置向量。在这种方法中,位置向量是通过对序列中所有单词的注意力加权求和得到的。通过这种方法,可以将位置信息编码为向量,并使用这些向量来处理序列数据。 3. 基于卷积神经网络的位置编码(Convolutional Neural Network-Based Position Encoding):这种方法使用卷积神经网络来为序列中的每个位置生成一个位置向量。在这种方法中,卷积神经网络学习一组卷积核,这些卷积核可以捕捉到序列中的位置信息。使用这些卷积核,可以为每个位置生成一个位置向量,然后将这些向量与词嵌入向量相加,形成一个位置编码向量。 这些可学习position embedding 技术广泛应用于自然语言处理任务中,例如机器翻译、文本分类、情感分析等任务。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值