- 本文为李宏毅 2021 ML 课程的笔记
目录
Sophisticated Input
- Input is a set of vectors (e.g. 一系列的词向量 (word embedding), 语音…)
What is the output?
- Every vector has a label. (Sequnce Lebeling) (e.g. 预测每个输入词汇对应的词性…)
- The whole sequence has a label. (e.g. Sentiment analysis: 给定一个评论,判定它的情感为 positive 还是 negative)
- Model decides the number of labels itself. (seq2seq) (e.g. 翻译)
Sequnce Lebeling
- 下面以词性判别为例。我们可以用一个 FC (Fully-connected network) 来进行词性判定,但显然存在问题:对于例子中的 I saw a saw,同一个 FC 肯定无法对两个 saw 输出不同的词性
Is it possible to consider the context?
- FC can consider the neighbor. (Use a window to consider the context)
How to consider the whole sequence?
- a window covers the whole sequence?
- Input Seq 是变长的,如果取最长的可能长度,会导致 FC 需要更多的参数,使得运算量变大,同时也更容易 overfitting
- Solution: Self-attention !
Self-attention
- 如下图所示,先通过 Self-attention 输出多个向量,输出的每个向量 (下图中带黑框的向量) 都考虑了整个 seq 的信息;再将考虑了整个句子的向量输入 FC 得到最后的输出
- 也可以交替使用 self-attention 和 FC
Self-attention
Transformer: Attention is all you need.
Self-attention
Self-Attention: 自己对自己的 Attention,就是计算一个时序数据中每个元素与其他元素的关系
- Self-attention 的输入为
a
i
a^i
ai,它可以是 input,也可以是某个 hidden layer 的输出;输出为
b
i
b_i
bi,每个
b
i
b_i
bi 都考虑了所有输入的
a
j
(
1
≤
j
≤
4
)
a_j(1\leq j\leq4)
aj(1≤j≤4)
怎么输出 b 1 b^1 b1 ?
- Find the relevant vectors in a sequence: 首先我们想要找到所有和
a
1
a^1
a1 相关的输入向量以便生成第一个输出
b
1
b^1
b1; 每两个向量之间的相关性由
α
\alpha
α (attention score) 表示
怎么计算 attention score α \alpha α ?
- 方法 1 (之后的讲解中都默认使用这种方法,这种方法也是最常见的): Dot-product
- 方法 2: Additive
- …
Dot-product 计算 Attention Score
一般也会跟自己计算关联性,即计算出 k 1 k^1 k1 后再与 q 1 q^1 q1 作点积运算得到 α 1 , 1 \alpha_{1,1} α1,1
- 计算出
a
1
a^1
a1 与每一个 vector 的关联性之后,会作一个 Softmax 操作
α 1 , i ′ = exp ( α 1 , i ) / ∑ j exp ( α 1 , j ) \alpha_{1, i}^{\prime}=\exp \left(\alpha_{1, i}\right) / \sum_{j} \exp \left(\alpha_{1, j}\right) α1,i′=exp(α1,i)/j∑exp(α1,j)
当然也不一定要用 Softmax,用其他的激活函数也可以,例如 ReLU
Extract information based on attention scores
- 根据 attention score (向量之间的关联性) 作加权平均以抽取全局信息
b 1 = ∑ i α 1 , i ′ v i b^1=\sum_i\alpha_{1,i}'v^i b1=i∑α1,i′vi - 注意到,
b
1
∼
b
4
b^1\sim b^4
b1∼b4 可以并行产生
使用矩阵运算实现 Self-attention
计算 q , k , v q,k,v q,k,v (矩阵操作)
- 可以把输入向量拼起来得到一个输入矩阵,直接乘上
W
q
W^q
Wq /
W
k
W^k
Wk /
W
v
W^v
Wv 就可以计算出相应的
q
,
k
,
v
q,k,v
q,k,v
计算 attention score
计算输出 b b b
总结
Multi-head Self-attention (Different types of relevance)
2 heads as example
- 不同的
q
,
k
,
v
q,k,v
q,k,v 负责捕捉不同种类的相关性
q i = W q a i q i , 1 = W q , 1 q i q i , 2 = W q , 2 q i q^i=W^qa^i\\ q^{i,1}=W^{q,1}q^i\\ q^{i,2}=W^{q,2}q^i qi=Wqaiqi,1=Wq,1qiqi,2=Wq,2qi
k , v k,v k,v 的计算与 q q q 类似
Positional Encoding
- No position information in self-attention (之前的 self-attention 没有考虑位置信息 (天涯若比邻))
-
⇒
\Rightarrow
⇒ Each position has a unique positional vector
e
i
e^i
ei (hand-crafted or leared from data)
-
⇒
\Rightarrow
⇒ Each position has a unique positional vector
e
i
e^i
ei (hand-crafted or leared from data)
将 e i e^i ei 可视化 (hand-crafted)
- Each column represents a positional vector
e
i
e_i
ei. (positional vector in the paper “attention is all you need”)
有各式各样的方法可以产生 positional encoding
Applications
Widely used in Natural Langue Processing (NLP)!
- Transformer: Attention Is All You Need
- BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert is a yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street.
Self-attention for Speech
- Truncated Self-attention: Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
- 在语音识别中,序列长度过长 (Speech is a very long vector sequence),使得 Attention Matrix 过大,因此我们可以只考虑一部分的输入,对其作 self-attention
Self-attention for Image
- An image can also be considered as a vector set.
- e.g. Self-Attention GAN: Self-Attention Generative Adversarial Networks; DEtection Transformer (DETR)
- Non-local Neural Networks: 概述可参照知乎:计算机视觉中的 Non-local-Block 以及其他注意力机制
Performance
Self-attention v.s. CNN
- CNN: self-attention that can only attends in a receptive field
- Self-attention: CNN with learnable receptive field (Self-attention 会考虑整张图片的信息而非局限在感受野中,相当于自己学出了一个十分复杂的感受野)
- CNN is simplified self-attention. Self-attention is the complex version of CNN.
- CNN is simplified self-attention. Self-attention is the complex version of CNN.
- 推荐论文 (2019.11): On the Relationship between Self-Attention and Convolutional Layers (用数学严谨证明 CNN 是 Self-attention 的特例,只要给 Self-attention 设置合适的参数,它可以做到与 CNN 一样的事情)
- Self-attention 更复杂,更易过拟合 (数据量大时,self-attention 更好)
- 下面的实验结果来自 paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (题目来源: 将一个 image 拆成了 16x16 个 patch,把每个 patch 想象成一个 word)
Self-attention v.s. RNN
RNN 基本可以被 Self-attention 取代了
Recurrent Neural Network (RNN)
Self-attention
Self-attention for Graph
- Consider edge: only attention to connected nodes (在计算 Attention Score 时,只考虑有边相连的结点,也就是只计算下图中蓝色方块对应的 Attention Score)
- This is one type of Graph Neural Network (GNN)
每个 node 都是 1 个 vector,例如可以看作社交网络图
To learn more
- Long Range Arena: A Benchmark for Efficient Transformers: 比较了各种 self-attention 的变形 (未来的重点是减少 self-attention 巨大的计算量)
- 下图中横轴代表运算速度;纵轴代表 performance
- 下图中横轴代表运算速度;纵轴代表 performance
Self-attention 最早用在 Transformer 上,所以很多时候 Transformer 就是指的 self-attention;而后来 Self-attention 的各种变形也都叫作 xx f o r m e r former former
- Efficient Transformers: A Survey: 介绍各种 Self-attention 的变形