Attention Mechanism

最新推荐文章于 2024-08-15 20:42:08 发布

檀檀吸甲烷

最新推荐文章于 2024-08-15 20:42:08 发布

阅读量238

点赞数

分类专栏： Machine Learning 文章标签：深度学习

本文链接：https://blog.csdn.net/qq_41009458/article/details/120408036

版权

Machine Learning 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

University of Waterloo CS480/680 Lecture 19: Attention and Transformer Networks

CS480Lec19

1 Attention Overview

1.1 RNN’s challenge

Long range dependencies: how to deal with
Combine it with some attention mechanism
Gradient vanishing and explosion
Large number of training steps
We should unroll it for as many steps as needed to corresponding sequence. They are arbitrarily deep network. Layers are correlated to each other, so it tends to require a lot of steps.
They cost more time than convolutional networks.
Recurrence prevents parallel computation
GPU has become key to deal with large networks which allows parallel computation. However, in RNN the computation has to be done sequentially.

1.2 Transformer Network

Facilitate long range dependencies
Attention allows us to draw some connections between any part of the sequence. So long range dependencies has the same likelihood to be taken account of like short range dependencies.
No gradient vanishing and explosion
Transformer will going to do the computation for the entire sequence simultaneously instead of linear computation. Not so many layers in practice.
Fewer training steps
No recurrence that facilitate parallel computation

2 Attention Mechanism

2.1 Basics

Mimics the retrieval of a value $v_i$ for a query $q$ based on a key $k_i$ in a database.

$attention(q,k,v)=\sum_i similarity(q, k_i)\times v_i$
The similarity will return a weight and then simply product an output that is a weighted combination of all the values in our database.
Normally we return 1 value to find the similarity between the query and the key in one-hot encoding form. Then we can do back propagation.

2.2 Neural architecture

Attention mechanism

The first layer:
Compute a similarity measure between query q and each key
$s_i=f(q, k_i)$
We could choose many functions:

Dot product: $q^Tk_i$
Scaled dot product: $\frac {q^Tk_i}{\sqrt d}$ where $d$ is the dimensionality of each key. This has the advantage of simply keeping the dot product in a certain scale.
A general dot product: $q^Twk_i$ that project the query into a new space by using a weight matrix $w$ .
Additive similarity: $w^T_q q + w^T_k k_i$
Kernel methods.

The second layer:
Compute the weights with a fully connected network.
$a_i=\frac{exp(s_i)}{\sum_j exp(s_j)}$

The third layer:
Attention value is just the sum over $a_i$ and $v_i$ .
$\sum_i a_iv_i$

2 Transformer Network Structure

Encoder

Input: a sequence
Positional Encoding: Since position matters in sequtial data, we need to mark each distinct position.
Multihead attention: compute the attention between each position and every other position
N*: Combine pairs. First look at pairs, then look at pairs of pairs…Eventually we will get all words in the sentence included. We will get a sequence of embeddings that captures both the original word and other words in the sentence. Imagine it as a large embedding of all those words corresponding to each position.
Add residuals and Norm: add a residual input to multi-head attention and normalize this.
Feed Forward network

Decoder

Decoder will produce some output

Masked multi-head attention: mask the future word so each word will only correspond to previous words.

2.1 Multihead Attention

MultiheadAttention

Linear layer of V, K, Q: think it as a linear projection
Use 3 different linear combination - 3 different projecton
For each combination, we compute different scaled dot product attention, which could be seen as multiple feature maps.
Concat layers: concat different dot product attention and get a linear combination that is the multi-head attention.
$h$ is the number of heads in multi-head attention

$multihead(Q,K,V)=W^Oconcat(head_1,head_2,..., head_h)$

$head_i = attention({W_i^Q}Q, {W_i^K}K, {W_i^V}V)$

$attention(Q,K,V)=softmax(\frac{{Q^T}K}{\sqrt{d_k}})V$

2.2 Masked Multi-head Attention

Multi-head where some values are masked i.e., probabilities of masked values are nullified to prevent them from being selected
For example, when decoding, an output values should only depend on previous outputs (not future outputs). Hence we mask future outputs.
Think the $M$ as another form of dropout. It will not change the overall probability distribution. It is an lower-triangular matrix with 0’s on the lower triangular part and minus infinity’s on the upper-triangular part.
No recurrence: use teacher forcing. So no recurrence relation in training. We could also use parallel computation.

$maskedAttention(Q,K,V)=softmax(\frac{{Q^T}K+M}{\sqrt{d_k}})V$
where $M$ is a mask matrix of $0$ 's and $-\infty$ 's

2.3 Layer Normalization

Normalize values in each layer to have $0$ mean and $1$ variance. (if $g = 1$ )

2.3.1 Step

For each hidden unit $h_i$ compute $h_i=\frac{g}{\sigma}(h_i-\mu)$ where $g$ is a variable, $\mu=\frac{1}{H}\sum^H_{i=1}h_i$ and $\sigma=\sqrt{\frac{1}{H}\sum^H_{i=1}(h_i-\mu)^2}$

2.3.2 Why

It lies on top of each multi-head attention and feed-forward network, which helps to essentially reduce the number of steps needed by gradient descent to optimize the network. We got weights in each layer to be trained by GD. When looking at the formula of GD, it depends on the output between 2 layers below and above.
The problem is if we still adjust weights below and above, and when we compute the gradient, it will not be stable. We would rather wait till all these layers have stablized and then we can optimize the gradient in the middle. However, we could only compute the gradient simultanously so the convergence is very small. But there is no way we get rid of all these inter dependencies because if we did that will mean that we break our network into parts.
When we normalize, it ensures that the output of that layer, regardless of the weights, the mean and variance are going to be the same.
This reduces ‘covariate shift’ (gradient dependencies between each layer) and therefore hewer training iterations are needed.
$g$ is a compensate for the fact that we’ve just normalized. With this approach, we could ensure that if $g = 1$ , $h$ is always normalized with 0 mean and var 1. Therefore, if there’s som gradient competition that depends on the output of that layer, the output will always be the same scale. As a result, the other gradients don’t have to be adjusted simply because we change the scale. So it reduced the dependency between layers.
This is closely related with batch normalization. The main difference is that we are doing normalization at the level of a layer, wheras batch normalization would do it for 1 hidden unit by normalization across a batch of inputs, which only works well when we have a large amount of batches. But here we can have mini batches.

2.4 Positional Encoding

Used in both encoder and decoder
A vector that is different depending on the position which is of the same dim of the word embedding. Then we add it to the embedding of the word.
Even $s i n$ , odd $c o s$ .

$PE_{position,2i}=sin(position/10000^{\frac{2i}{d}})$

$PE_{position,2i+1}=cos(position/10000^{\frac{2i}{d}})$
Maybe just adding it to the word embedding may affect the information in the original word embedding. A better way might be to concat them together. But this is how the authors chose which seemed to work well.

3 Comparison

Self Attention Network: in each layer which consists of $n$ position ( a sentence with size $n$ ), the embedding will be of dimension $d$ . For every position, try to attend to every other position so all the pairs will need $n^2$ computation. Then multiply it with dim $d$ .
No sequential operation becuase we have a sentence of size $n$ and process all of the words simultanously -> parallel the computation.
The maximal path between any $2$ positions is $1$ because we combine every pair, so it only needs $1$ step. We will lose no info like that in RNN.

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	$O(n^2 \cdot d)$	$O (1)$	$O (1)$
Recurrent	$\cdot d^2)$	$O (n)$	$O (n)$
Convolutional	$\cdot n \cdot d^2)$	$O (1)$	$O(log_k(n))$
Self-Attention(restricted)	$\cdot n \cdot d)$	$O (1)$	$O(\frac{n}{r})$

4 Results of attention

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.

在这里插入图片描述

5 Other state-of-art techniques

GPT & GPT-2
BERT

檀檀吸甲烷

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Attention Mechanism

来自University of Waterloo1 Attention Overview1.1 RNN’s challengeLong range dependencies: how to deal withCombine it with some attention mechanismGradient vanishing and explosionLarge number of trainig stepsWe should unroll it for as many steps as ne
复制链接

扫一扫

专栏目录