学习笔记位置编码 Position Embedding

CiLin-Yan

已于 2022-05-22 11:21:14 修改

阅读量1.7k

点赞数

分类专栏：学习笔记机器学习文章标签：学习机器学习深度学习

于 2022-05-19 22:20:37 首次发布

原文链接：https://paperswithcode.com/methods/category/position-embeddings

版权

学习笔记同时被 2 个专栏收录

20 篇文章

订阅专栏

机器学习

8 篇文章

订阅专栏

博客主要介绍了Transformer模型中的位置编码方法，包括Attention is All you Need、Universal transformers等不同类型的位置编码，还阐述了绝对位置编码、相对位置编码、旋转位置嵌入等多种编码方式的原理和特点，为理解Transformer模型的位置编码提供了详细信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Position Encodings in `Hung-yi Lee`

1. Attention is All you Need

$\boldsymbol{p}_{i}^{(1)}[j]= \begin{cases}\sin \left(i \cdot c^{\frac{j}{d}}\right) & \text { if } j \text { is even } \\ \cos \left(i \cdot c^{\frac{j-1}{d}}\right) & \text { if } j \text { is odd }\end{cases}$

2. Universal transformers

$P^{t} \in \mathbb{R}^{m \times d}$ above are fixed, constant, two-dimensional (position, time) coordinate embeddings, obtained by computing the sinusoidal position embedding vectors as defined in (Vaswani et al., 2017) for the positions $\leq i \leq m$ and the time-step $\leq t \leq T$ separately for each vector-dimension $\leq j \leq d$ , and summing:

$\begin{aligned} P_{i, 2 j}^{t} &=\sin \left(i / 10000^{2 j / d}\right)+\sin \left(t / 10000^{2 j / d}\right) \\ P_{i, 2 j+1}^{t} &=\cos \left(i / 10000^{2 j / d}\right)+\cos \left(t / 10000^{2 j / d}\right) . \end{aligned}$

or
$\boldsymbol{p}_{i}^{(n)}[j]= \begin{cases}\sin \left(i \cdot c^{\frac{j}{d}}\right)+\sin \left(n \cdot c^{\frac{j}{d}}\right) & \text { if } j \text { is even } \\ \cos \left(i \cdot c^{\frac{j-1}{d}}\right)+\cos \left(n \cdot c^{\frac{j-1}{d}}\right) & \text { if } j \text { is odd }\end{cases}$

3. Learnable Position Encodings

4. Learning to encode position for transformer with continuous dynamical model

$\boldsymbol{p}^{(n)}(t)=\boldsymbol{p}^{(n)}(s)+\int_{s}^{t} \boldsymbol{h}^{(n)}\left(\tau, \boldsymbol{p}^{(n)}(\tau) ; \boldsymbol{\theta}_{h}^{(n)}\right) \mathrm{d} \tau$

Position Encodings in `Papers with Code`

1. Absolute Position Encodings

Absolute Position Encodings are a type of position embeddings for [Transformer-based models] where positional encodings are added to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{\text {model }}$ as the embeddings, so that the two can be summed. In the original implementation, sine and cosine functions of different frequencies are used:
$\begin{gathered} \mathrm{PE}(\text { pos }, 2 i)=\sin \left(p o s / 10000^{2 i / d_{m o d d}}\right) \\ \mathrm{PE}(p o s, 2 i+1)=\cos \left(p o s / 10000^{2 i / d_{m o d d}}\right) \end{gathered}$
where pos is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $\pi$ to $10000\cdot 2 \pi$ . This function was chosen because the authors hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $\mathrm{PE}_{\text {pos }+k}$ can be represented as a linear function of $\mathrm{PE}_{\text {pos }}$ .

在这里插入图片描述

来源： Papers With Code, D2L.ai

2. Relative Position Encodings

Relative Position Encodings are a type of position embeddings for Transformer-based models that attempts to exploit pairwise, relative positional information. Relative positional information is supplied to the model on two levels: values and keys. This becomes apparent in the two modified self-attention equations shown below. First, relative positional information is supplied to the model as an additional component to the keys
$e_{i j}=\frac{x_{i} W^{Q}\left(x_{j} W^{K}+a_{i j}^{K}\right)^{T}}{\sqrt{d_{z}}}$
Here $a$ is an edge representation for the inputs $x_{i}$ and $x_{j}$ . The softmax operation remains unchanged from vanilla self-attention. Then relative positional information is supplied again as a sub-component of the values matrix:
$z_{i}=\sum_{j=1}^{n} \alpha_{i j}\left(x_{j} W^{V}+a_{i j}^{V}\right)$
In other words, instead of simply combining semantic embeddings with absolute positional ones, relative positional information is added to keys and values on the fly during attention calculation.

在这里插入图片描述

来源： Papers With Code, Jake Tae

3. Rotary Position Embedding

Rotary Position Embedding, or RoPE, is a type of position embedding which encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding.

来源： Papers With Code, RoFormer

4. Conditional Positional Encoding

Conditional Positional Encoding, or CPE, is a type of positional encoding for vision transformers. Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE aims to generalize to the input sequences that are longer than what the model has ever seen during training. CPE can also keep the desired translation-invariance in the image classification task. CPE can be implemented with a Position Encoding Generator (PEG) and incorporated into the current Transformer framework.

来源： Papers With Code, Xiangxiang Chu

5. Attention with Linear Biases

ALiBi, or Attention with Linear Biases, is an alternative to position embeddings for inference extrapolation in Transformer models. When computing the attention scores for each head, the ALiBi method adds a constant bias to each attention score $\left(\mathbf{q}_{i} \cdot \mathbf{k}_{j}, \text{left} \right)$ As in the unmodified attention sublayer, the softmax function is then applied to these scores, and the rest of the computation is left unmodified. m is a head-specific scalar that is set and not learned throughout training. When using ALiBi no positional embeddings are added at the bottom of the network.