Transformer小结

最新推荐文章于 2024-06-02 15:17:56 发布

iTensor

最新推荐文章于 2024-06-02 15:17:56 发布

阅读量929

点赞数

分类专栏：深度学习自然语言处理深度学习

本文链接：https://blog.csdn.net/wshixinshouaaa/article/details/100501602

版权

[Model Architecture](#Model Architecture)Self-Attention[Multi-Head Attention](#Multi-Head Attention)[Positional Encoding](#Positional Encoding)EncoderDecoderSummaryReferenceModel Architecture...

摘要由CSDN通过智能技术生成

文章目录

Model Architecture

一、Encoder

Encoder 由六个相同的层堆叠组成，每个层中又包含两个子层：Multi-Head Attention 和 Feed Forward。每个子层后由残差连接一个 Layer Normalization 。

二、Decoder

Decoder 也由六个相同的层堆叠组成，每个层中包含三个子层：Masked Multi-Head Attention、Multi-Head Attention 和 Feed Forward，比 Encoder 多了一个子层，其他结构不变。

三、Multi-Head Attention

Multi-Head Attention 是由多个 Self-Attention 拼接而成，本质还是 Self-Attention 运算。

在这里插入图片描述

Self-Attention

根据上文中 Attention 的介绍，Attention 的本质可以看作是根据 Query ，查找键值对 <Key, Value>，即 Attention(Q, K , V) 。而 Self-Attention 顾名思义，其实就是 Attention(X, X, X) 。

Attention 结构图如下所示，在 Transformer 的 Encoder 中是没有Mask操作的。

在这里插入图片描述

一、首先获取 Query，Key， Value：

Transformer 给定的 Embedding 的维度是 512，直接作为 Query，Key， Value 的话计算量比较大。为了减少维度，训练三个数组 $W^{Q}，W^{K}，W^{V}$ ，使 Query，Key， Value 的维度降为 64。

在这里插入图片描述

二、将输入句子中的每个词与所有的词都进行 Attention 计算，根据权重值累加 Query 得到每个词新的表示。

在这里插入图片描述

三、一些细节

先来说第一步中如何利用数组 $W^{Q}，W^{K}，W^{V}$ 使 Query，Key， Value 的维度降为 64。

输入 X 的大小为 (句子长度 * Embedding_size) ，而 $W^{Q}，W^{K}，W^{V}$

最低0.47元/天解锁文章

iTensor

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
Transformer小结

[Model Architecture](#Model Architecture)Self-Attention[Multi-Head Attention](#Multi-Head Attention)[Positional Encoding](#Positional Encoding)EncoderDecoderSummaryReferenceModel Architecture...
复制链接

扫一扫

专栏目录