Attention Mechanism

University of Waterloo CS480/680 Lecture 19: Attention and Transformer Networks

CS480Lec19

1 Attention Overview

1.1 RNN’s challenge

  • Long range dependencies: how to deal with
    Combine it with some attention mechanism
  • Gradient vanishing and explosion
  • Large number of training steps
    We should unroll it for as many steps as needed to corresponding sequence. They are arbitrarily deep network. Layers are correlated to each other, so it tends to require a lot of steps.
    They cost more time than convolutional networks.
  • Recurrence prevents parallel computation
    GPU has become key to deal with large networks which allows parallel computation. However, in RNN the computation has to be done sequentially.

1.2 Transformer Network

  • Facilitate long range dependencies
    Attention allows us to draw some connections between any part of the sequence. So long range dependencies has the same likelihood to be taken account of like short range dependencies.
  • No gradient vanishing and explosion
    Transformer will going to do the computation for the entire sequence simultaneously instead of linear computation. Not so many layers in practice.
  • Fewer training steps
  • No recurrence that facilitate parallel computation



2 Attention Mechanism

2.1 Basics

Mimics the retrieval of a value v i v_i vi for a query q q q based on a key k i k_i ki in a database.

a t t e n t i o n ( q , k , v ) = ∑ i s i m i l a r i t y ( q , k i ) × v i attention(q,k,v)=\sum_i similarity(q, k_i)\times v_i attention(q,k,v)=isimilarity(q,ki)×vi
The similarity will return a weight and then simply product an output that is a weighted combination of all the values in our database.
Normally we return 1 value to find the similarity between the query and the key in one-hot encoding form. Then we can do back propagation.


2.2 Neural architecture

Attention mechanism

The first layer:
Compute a similarity measure between query q and each key
s i = f ( q , k i ) s_i=f(q, k_i) si=f(q,ki)
We could choose many functions:

  • Dot product: q T k i q^Tk_i qTki
  • Scaled dot product: q T k i d \frac {q^Tk_i}{\sqrt d} d qTki where d d d is the dimensionality of each key. This has the advantage of simply keeping the dot product in a certain scale.
  • A general dot product: q T w k i q^Twk_i qTwki that project the query into a new space by using a weight matrix w w w.
  • Additive similarity: w q T q + w k T k i w^T_q q + w^T_k k_i wqTq+wkTki
  • Kernel methods.

The second layer:
Compute the weights with a fully connected network.
a i = e x p ( s i ) ∑ j e x p ( s j ) a_i=\frac{exp(s_i)}{\sum_j exp(s_j)} ai=jexp(sj)exp(si)

The third layer:
Attention value is just the sum over a i a_i ai and v i v_i vi.
A t t e n t i o n V a l u e = ∑ i a i v i AttentionValue = \sum_i a_iv_i AttentionValue=iaivi





2 Transformer Network Structure

Encoder

  • Input: a sequence
  • Positional Encoding: Since position matters in sequtial data, we need to mark each distinct position.
  • Multihead attention: compute the attention between each position and every other position
    N*: Combine pairs. First look at pairs, then look at pairs of pairs…Eventually we will get all words in the sentence included. We will get a sequence of embeddings that captures both the original word and other words in the sentence. Imagine it as a large embedding of all those words corresponding to each position.
  • Add residuals and Norm: add a residual input to multi-head attention and normalize this.
  • Feed Forward network

Decoder

Decoder will produce some output

  • Masked multi-head attention: mask the future word so each word will only correspond to previous words.
    Transformer model architecture

2.1 Multihead Attention

MultiheadAttention

  1. Linear layer of V, K, Q: think it as a linear projection
  2. Use 3 different linear combination - 3 different projecton
    For each combination, we compute different scaled dot product attention, which could be seen as multiple feature maps.
  3. Concat layers: concat different dot product attention and get a linear combination that is the multi-head attention.
    h h h is the number of heads in multi-head attention

m u l t i h e a d ( Q , K , V ) = W O c o n c a t ( h e a d 1 , h e a d 2 , . . . , h e a d h ) multihead(Q,K,V)=W^Oconcat(head_1,head_2,..., head_h) multihead(Q,K,V)=WOconcat(head1,head2,...,headh)

h e a d i = a t t e n t i o n ( W i Q Q , W i K K , W i V V ) head_i = attention({W_i^Q}Q, {W_i^K}K, {W_i^V}V) headi=attention(WiQQ,WiKK,WiVV)

a t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q T K d k ) V attention(Q,K,V)=softmax(\frac{{Q^T}K}{\sqrt{d_k}})V attention(Q,K,V)=softmax(dk QTK)V


2.2 Masked Multi-head Attention

  • Multi-head where some values are masked i.e., probabilities of masked values are nullified to prevent them from being selected
  • For example, when decoding, an output values should only depend on previous outputs (not future outputs). Hence we mask future outputs.
  • Think the M M M as another form of dropout. It will not change the overall probability distribution. It is an lower-triangular matrix with 0’s on the lower triangular part and minus infinity’s on the upper-triangular part.
  • No recurrence: use teacher forcing. So no recurrence relation in training. We could also use parallel computation.

m a s k e d A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q T K + M d k ) V maskedAttention(Q,K,V)=softmax(\frac{{Q^T}K+M}{\sqrt{d_k}})V maskedAttention(Q,K,V)=softmax(dk QTK+M)V
where M M M is a mask matrix of 0 0 0's and − ∞ -\infty 's


2.3 Layer Normalization

Normalize values in each layer to have 0 0 0 mean and 1 1 1 variance. (if g = 1 g=1 g=1)

2.3.1 Step

For each hidden unit h i h_i hi compute h i = g σ ( h i − μ ) h_i=\frac{g}{\sigma}(h_i-\mu) hi=σg(hiμ) where g g g is a variable, μ = 1 H ∑ i = 1 H h i \mu=\frac{1}{H}\sum^H_{i=1}h_i μ=H1i=1Hhi and σ = 1 H ∑ i = 1 H ( h i − μ ) 2 \sigma=\sqrt{\frac{1}{H}\sum^H_{i=1}(h_i-\mu)^2} σ=H1i=1H(hiμ)2


2.3.2 Why

  • It lies on top of each multi-head attention and feed-forward network, which helps to essentially reduce the number of steps needed by gradient descent to optimize the network. We got weights in each layer to be trained by GD. When looking at the formula of GD, it depends on the output between 2 layers below and above.
  • The problem is if we still adjust weights below and above, and when we compute the gradient, it will not be stable. We would rather wait till all these layers have stablized and then we can optimize the gradient in the middle. However, we could only compute the gradient simultanously so the convergence is very small. But there is no way we get rid of all these inter dependencies because if we did that will mean that we break our network into parts.
  • When we normalize, it ensures that the output of that layer, regardless of the weights, the mean and variance are going to be the same.
  • This reduces ‘covariate shift’ (gradient dependencies between each layer) and therefore hewer training iterations are needed.
  • g g g is a compensate for the fact that we’ve just normalized. With this approach, we could ensure that if g = 1 g=1 g=1, h h h is always normalized with 0 mean and var 1. Therefore, if there’s som gradient competition that depends on the output of that layer, the output will always be the same scale. As a result, the other gradients don’t have to be adjusted simply because we change the scale. So it reduced the dependency between layers.
  • This is closely related with batch normalization. The main difference is that we are doing normalization at the level of a layer, wheras batch normalization would do it for 1 hidden unit by normalization across a batch of inputs, which only works well when we have a large amount of batches. But here we can have mini batches.

2.4 Positional Encoding

  • Used in both encoder and decoder

  • A vector that is different depending on the position which is of the same dim of the word embedding. Then we add it to the embedding of the word.

  • Even s i n sin sin, odd c o s cos cos.

    P E p o s i t i o n , 2 i = s i n ( p o s i t i o n / 1000 0 2 i d ) PE_{position,2i}=sin(position/10000^{\frac{2i}{d}}) PEposition,2i=sin(position/10000d2i)

    P E p o s i t i o n , 2 i + 1 = c o s ( p o s i t i o n / 1000 0 2 i d ) PE_{position,2i+1}=cos(position/10000^{\frac{2i}{d}}) PEposition,2i+1=cos(position/10000d2i)

  • Maybe just adding it to the word embedding may affect the information in the original word embedding. A better way might be to concat them together. But this is how the authors chose which seemed to work well.


3 Comparison

  • Self Attention Network: in each layer which consists of n n n position ( a sentence with size n n n), the embedding will be of dimension d d d. For every position, try to attend to every other position so all the pairs will need n 2 n^2 n2 computation. Then multiply it with dim d d d.
  • No sequential operation becuase we have a sentence of size n n n and process all of the words simultanously -> parallel the computation.
  • The maximal path between any 2 2 2 positions is 1 1 1 because we combine every pair, so it only needs 1 1 1 step. We will lose no info like that in RNN.
Layer TypeComplexity per LayerSequential OperationsMaximum Path Length
Self-Attention O ( n 2 ⋅ d ) O(n^2 \cdot d) O(n2d) O ( 1 ) O(1) O(1) O ( 1 ) O(1) O(1)
Recurrent O ( n ⋅ d 2 ) O(n \cdot d^2) O(nd2) O ( n ) O(n) O(n) O ( n ) O(n) O(n)
Convolutional O ( k ⋅ n ⋅ d 2 ) O(k \cdot n \cdot d^2) O(knd2) O ( 1 ) O(1) O(1) O ( l o g k ( n ) ) O(log_k(n)) O(logk(n))
Self-Attention(restricted) O ( r ⋅ n ⋅ d ) O(r \cdot n \cdot d) O(rnd) O ( 1 ) O(1) O(1) O ( n r ) O(\frac{n}{r}) O(rn)

4 Results of attention

BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.

在这里插入图片描述


5 Other state-of-art techniques

  • GPT & GPT-2
  • BERT
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值