Self-Attention

最新推荐文章于 2022-12-12 11:42:49 发布

连理o

最新推荐文章于 2022-12-12 11:42:49 发布

阅读量346

点赞数

分类专栏：深度学习文章标签： attention

本文链接：https://blog.csdn.net/weixin_42437114/article/details/118476644

版权

深度学习专栏收录该内容

27 篇文章 14 订阅

订阅专栏

本文为李宏毅 2021 ML 课程的笔记

Sophisticated Input

Input is a set of vectors (e.g. 一系列的词向量 (word embedding), 语音…)

What is the output?

Every vector has a label. (Sequnce Lebeling) (e.g. 预测每个输入词汇对应的词性…)
The whole sequence has a label. (e.g. Sentiment analysis: 给定一个评论，判定它的情感为 positive 还是 negative)
Model decides the number of labels itself. (seq2seq) (e.g. 翻译)

Sequnce Lebeling

下面以词性判别为例。我们可以用一个 FC (Fully-connected network) 来进行词性判定，但显然存在问题：对于例子中的 I saw a saw，同一个 FC 肯定无法对两个 saw 输出不同的词性

Is it possible to consider the context?

FC can consider the neighbor. (Use a window to consider the context)

How to consider the whole sequence?

a window covers the whole sequence?
- Input Seq 是变长的，如果取最长的可能长度，会导致 FC 需要更多的参数，使得运算量变大，同时也更容易 overfitting
Solution: Self-attention !

Self-attention

如下图所示，先通过 Self-attention 输出多个向量，输出的每个向量 (下图中带黑框的向量) 都考虑了整个 seq 的信息；再将考虑了整个句子的向量输入 FC 得到最后的输出
也可以交替使用 self-attention 和 FC

Self-attention

Transformer: Attention is all you need.

Self-attention

Self-Attention: 自己对自己的 Attention，就是计算一个时序数据中每个元素与其他元素的关系

Self-attention 的输入为 $a^i$ ，它可以是 input，也可以是某个 hidden layer 的输出；输出为 $b_i$ ，每个 $b_i$ 都考虑了所有输入的 $a_j(1\leq j\leq4)$

怎么输出 $b^1$ ?

Find the relevant vectors in a sequence: 首先我们想要找到所有和 $a^1$ 相关的输入向量以便生成第一个输出 $b^1$ ; 每两个向量之间的相关性由 $\alpha$ (attention score) 表示

怎么计算 attention score $\alpha$ ?

方法 1 (之后的讲解中都默认使用这种方法，这种方法也是最常见的): Dot-product
方法 2: Additive
- …

Dot-product 计算 Attention Score

在这里插入图片描述

一般也会跟自己计算关联性，即计算出 $k^1$ 后再与 $q^1$ 作点积运算得到 $\alpha_{1,1}$

计算出 $a^1$ 与每一个 vector 的关联性之后，会作一个 Softmax 操作
$\alpha_{1, i}^{\prime}=\exp \left(\alpha_{1, i}\right) / \sum_{j} \exp \left(\alpha_{1, j}\right)$

当然也不一定要用 Softmax，用其他的激活函数也可以，例如 ReLU

Extract information based on attention scores

根据 attention score (向量之间的关联性) 作加权平均以抽取全局信息
$b^1=\sum_i\alpha_{1,i}'v^i$
注意到， $b^1\sim b^4$ 可以并行产生

使用矩阵运算实现 Self-attention

计算 $q, k, v$ (矩阵操作)

可以把输入向量拼起来得到一个输入矩阵，直接乘上 $W^q$ / $W^k$ / $W^v$ 就可以计算出相应的 $q, k, v$

计算 attention score

在这里插入图片描述

计算输出 $b$

在这里插入图片描述

总结

在这里插入图片描述

Multi-head Self-attention (Different types of relevance)

2 heads as example

不同的 $q, k, v$ 负责捕捉不同种类的相关性
$q^i=W^qa^i\\ q^{i,1}=W^{q,1}q^i\\ q^{i,2}=W^{q,2}q^i$

$k, v$ 的计算与 $q$ 类似

在这里插入图片描述

Positional Encoding

No position information in self-attention (之前的 self-attention 没有考虑位置信息 (天涯若比邻))
- $\Rightarrow$ Each position has a unique positional vector $e^i$ (hand-crafted or leared from data)

将 $e^i$ 可视化 (hand-crafted)

Each column represents a positional vector $e_i$ . (positional vector in the paper “attention is all you need”)

有各式各样的方法可以产生 positional encoding

Applications

Widely used in Natural Langue Processing (NLP)!

Transformer: Attention Is All You Need
BERT: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert is a yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street.

Self-attention for Speech

Truncated Self-attention: Transformer-Transducer: End-to-End Speech Recognition with Self-Attention
在语音识别中，序列长度过长 (Speech is a very long vector sequence)，使得 Attention Matrix 过大，因此我们可以只考虑一部分的输入，对其作 self-attention

Self-attention for Image

An image can also be considered as a vector set.
e.g. Self-Attention GAN: Self-Attention Generative Adversarial Networks; DEtection Transformer (DETR)
Non-local Neural Networks: 概述可参照知乎：计算机视觉中的 Non-local-Block 以及其他注意力机制

Performance

Self-attention v.s. CNN

CNN: self-attention that can only attends in a receptive field
Self-attention: CNN with learnable receptive field (Self-attention 会考虑整张图片的信息而非局限在感受野中，相当于自己学出了一个十分复杂的感受野)
- CNN is simplified self-attention. Self-attention is the complex version of CNN.

推荐论文 (2019.11): On the Relationship between Self-Attention and Convolutional Layers (用数学严谨证明 CNN 是 Self-attention 的特例，只要给 Self-attention 设置合适的参数，它可以做到与 CNN 一样的事情)

Self-attention 更复杂，更易过拟合 (数据量大时，self-attention 更好)
下面的实验结果来自 paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (题目来源: 将一个 image 拆成了 16x16 个 patch，把每个 patch 想象成一个 word)

Self-attention v.s. RNN

RNN 基本可以被 Self-attention 取代了

Recurrent Neural Network (RNN)

在这里插入图片描述

Self-attention

在这里插入图片描述

推荐论文: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Self-attention for Graph

Consider edge: only attention to connected nodes (在计算 Attention Score 时，只考虑有边相连的结点，也就是只计算下图中蓝色方块对应的 Attention Score)
This is one type of Graph Neural Network (GNN)

每个 node 都是 1 个 vector，例如可以看作社交网络图

To learn more

Long Range Arena: A Benchmark for Efficient Transformers: 比较了各种 self-attention 的变形 (未来的重点是减少 self-attention 巨大的计算量)
- 下图中横轴代表运算速度；纵轴代表 performance

Self-attention 最早用在 Transformer 上，所以很多时候 Transformer 就是指的 self-attention；而后来 Self-attention 的各种变形也都叫作 xx $f o r m e r$

Efficient Transformers: A Survey: 介绍各种 Self-attention 的变形

连理o

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Self-Attention

目录Sophisticated InputWhat is the output?Sequnce LebelingSelf-attention矩阵运算Multi-head Self-attentionPositional EncodingApplicationsSelf-attention v.s. CNNSelf-attention v.s. RNNSelf-attention for GraphTo learn moreTransformerEncoderDecoder - Autoregressive
复制链接

扫一扫