数解 transformer 之 self attention transformer 公式整理

原创已于 2024-02-15 14:55:42 修改 · 932 阅读

·

4

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

#transformer #深度学习 #人工智能

于 2024-02-08 20:17:33 首次发布

千万不要从任何角度轻看 transformer，重要的话说四遍：

千万不要从任何角度轻看 transformer

千万不要从任何角度轻看 transformer

千万不要从任何角度轻看 transformer

Attention is all you need 整个项目是鬼斧神工之作，巧夺天工之作，堪称神来之笔

它比后来的 Bert GPT x.y flash attention 等不同角度的工作，都出神入化。

本文对应论文 “Attention is All You Need”，着重算法的数学表达

句子长度为n；比如 n=1024，或 n=2048，即，一句话最多可以是1024个单词，或 2048 个单词。

1. 位置编码

$\mathbf{E}=[e_1 e_2 \cdots e_{n}]\\\\ e_{pos}(2i) = PE(pos, 2i) = sin(pos/10000^{2i/d_{model}})\\ e_{pos}(2i+1) =PE(pos,2i+1)=cos(pos/10000^{2i/d_{model}})\\ \\where\\ pos \in \{1,2,\cdots,n\},n=1024=max\_sentence\_length\\ i \in [0, d_{model}-1]\\ d_{model}=512 = word\_embedding\_dimention$

可知， $\mathbf{E}$ 是由n个列向量组成的矩阵 $\mathbf{E}(512 \times 1024)$ ，每个列向量表示该列号 $pos$ 的位置编码向量 $\mathbf{E}(:,pos)$ 。

2. 输入向量

加入本句话第一个单词的词嵌入向量是 $x_1$ , 第二个单词是 $x_2$ , 以此类推，最多是 $x_n$ .

如果句子长度不足 n个单词，则后面没有单词对应的 $x_i = \mathbf{0}$

令 $X=(x_1\,x_2\,\cdots\,x_n)$ 为句子的词嵌入编码矩阵表示，为了让单词携带位置信息，直接将每个单词的词嵌入向量上加位置编码向量：

$x_i = x_i + e_i$

矩阵表示为：

$\mathbf{X=X+E}$

$\mathbf{X}=(x_1+e_1 \,\,x_2+e_2\,\,\cdots\,\,x_n+e_n)$

作为第一层 self-attention 模块的输入向量。

3. 完整的一层编码器计算过程

$\mathbf{X}=(x_1\,\,x_2\,\, \cdots\,\,x_n)$

$[q_1\,q_2\cdots\,q_n] = Q = W_qX=W_q[x_1\,\,x_2\,\,\cdots\,\,x_n]$

$[k_1\,k_2\,\cdots\,k_n]=K=W_kX=W_k[x_1\,\,x_2\,\,\cdots\,\,x_n]$

$[v_1\,v_2\,\cdots\,v_n]=V=W_vX=W_v[x_1\,\,x_2\,\,\cdots\,\,x_n]$

$\left[ \begin{array}{cccc} a_{1,1} & a_{2,1} & \cdots &a_{n,1}\\ a_{1,2} & a_{2,2} & \cdots &a_{n,2}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1,n} & a_{2,n} & \cdots &a_{n,n}\\ \end{array} \right] = A =K^TQ= \left[ \begin{array}{c} k_1^T\\ k_2^T\\ \vdots\\ k_n^T\\ \end{array} \right] [q_1\,q_2\, \cdots \,q_n]$

$\left[ \begin{array}{cccc} a_{1,1}^{'} & a_{2,1}^{'} & \cdots &a_{n,1}^{'}\\ a_{1,2}^{'} & a_{2,2}^{'} & \cdots &a_{n,2}^{'}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1,n}^{'} & a_{2,n}^{'} & \cdots &a_{n,n}^{'}\\ \end{array} \right] = A^{'} = \mathbf{softmax}_{column}(\mathbf{A}) = \mathbf{softmax}_{column} ( \left[ \begin{array}{cccc} a_{1,1} & a_{2,1} & \cdots &a_{n,1}\\ a_{1,2} & a_{2,2} & \cdots &a_{n,2}\\ \vdots & \vdots & \ddots & \vdots\\ a_{1,n} & a_{2,n} & \cdots &a_{n,n}\\ \end{array} \right] )$

$\mathbf{Y}_1=\mathbf{V}\mathbf{A}^{'}=[v_1\,v_2\,\cdots\,v_n]\mathbf{A}^{'}$

$\mathbf{Y}_1=\mathbf{Y}_1+\mathbf{X}$

$\mathbf{Y}_1=Norm(\mathbf{Y}_1)$

假设 $\mathbf{Y}_1$ 是有multihead中的 $\mathbf{Head}_1$ 所产生的输出矩阵，

$\mathbf{Y} =[\mathbf{Y_1Y_2 \cdots Y_8}]$

上面是把8个multihead的输出拼接起来了的到 $\mathbf{Y}$ 。

然后经过本层的这个feed forward neuron network：

$\mathbf{Z}=\mathbf{FFN}(\mathbf{Y})$

$\mathbf{Z} = \mathbf{Z}+\mathbf{Y}$

$\mathbf{Z}=Norm(\mathbf{Z})$

然后将 $\mathbf{Z}$ 送入下一层编码器，进行相同的计算过程，只是其中的 $\mathbf{W_q, W_k, W_v, FFN}$ 的权重不同而已。

4. Norm() 运算的细节

每一个层中都出现了两个次 normalize() 运算：

$\mathbf{Y}=Norm(\mathbf{Y})$

$\mathbf{Z}=Norm(\mathbf{Z})$

这里的作为输入和输出的 $\mathbf{Y, Z}$ 都是矩阵，矩阵的行数都是词嵌入的维度 $d_{model}=512$ ；

Y的列数是句子最大长度 max_sentence_length

Z的列数是句子最大长度的8倍，因为是8个multihead的结果矩阵拼接起来的产物。

但无论怎样，normalize运算仅单独作用在矩阵Y和Z的每一列数据上，使得本列数据归一为标准正态分布的样子，即，独立同分布，这样据说可以加速训练过程，加快模型收敛，

针对 $\mathbf{Y, Z}$ 具体实现如下：

假设需要被Norm()运算的矩阵抽象为用 $\mathbf{X}$ 来表示,

step1, 以矩阵的列为对象，计算本列元素的均值：

$u_j=\frac{1}{m} \mathop{\sum} ^{m}_{i=1}x_{ij}\\ \\where \,\, m=d_{model} = 512\\ j =1,2,\cdots ,max\_sentence\_length=1024$

step2, 继续以矩阵的列为对象，计算每列的方差：

$\sigma ^2_j=\frac{1}{m}\sum^m_{i=1}(x_{ij}-u_j)^2\\\\ where\,\,m=d_{model} = 512 \\j=1,2,\cdots ,max\_sentence\_length=1024$

step3, 归一化每个列元素，每个列元素减去本列均值，再除以方差：

$x_{ij}=\frac{x_{ij}-u_j}{\sqrt{\sigma^2_j + \epsilon}}$

其中分母中加了 $\epsilon$ ,仅仅是为了应对极低概率地出现 $\sigma^2_j = = 0$ 的分母为0的情况。

小结：

以上3个step的总体效果为：

$Norm(\mathbf{X})=\frac{\mathbf{x}_{ij} - u_j}{\sqrt{\sigma^2_j}+\epsilon}$

5. FNN的具体计算

6. 更多参考资料

原论文：

https://arxiv.org/abs/1706.03762dhttps://arxiv.org/abs/1706.03762 The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.Discussions:Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments)Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Italian, Japanese, Korean, Persian, Russian, Spanish 1, Spanish 2, VietnameseWatch: MIT’s Deep Learning State of the Art lecture referencing this postFeatured in courses at Stanford, Harvard, MIT, Princeton, CMU and othersIn the previous post, we looked at Attention – a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this post, we will look at The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.The Transformer was proposed in the paper Attention is All You Need. A TensorFlow implementation of it is available as a part of the Tensor2Tensor package. Harvard’s NLP group created a guide annotating the paper with PyTorch implementation. In this post, we will attempt to oversimplify things a bit and introduce the concepts one by one to hopefully make it easier to understand to people without in-depth knowledge of the subject matter.2020 Update: I’ve created a “Narrated Transformer” video which is a gentler approach to the topic:A High-Level LookLet’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.http://jalammar.github.io/illustrated-transformer/

图解Transformer（完整版）！笔者看过的 Transformer 讲解的最好的文章。https://mp.weixin.qq.com/s?__biz=MzI4MDYzNzg4Mw==&mid=2247515317&idx=3&sn=d06f49715290c8f8c56144031d1e60b3&chksm=ebb78461dcc00d77b57d12d4ec9388054ffa0e06fa1b2454e9c7f4b785f114983fe4708ecf0a&scene=27

自然语言处理Transformer模型最详细讲解（图解版）-CSDN博客文章浏览阅读1.3w次，点赞47次，收藏255次。近几年NLP较为流行的两大模型分别为Transformer和Bert，其中Transformer由论文《Attention is All You Need》提出。该模型由谷歌团队开发，Transformer是不同与传统RNN和CNN两大主流结构，它的内部是采用自注意力机制模块。_transformer模型https://blog.csdn.net/m0_47256162/article/details/127339899

Transformer详解 - mathorB站视频讲解Transformer是谷歌大脑在2017年底发表的论文attention is all you need中所提出的seq2seq模型。现在已经取得了大范围的应用和扩展，而BERT就...https://wmathor.com/index.php/archives/1438/

Transformers from scratch | peterbloem.nlhttps://peterbloem.nl/blog/transformers

未完待续 ... ...

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。