Attention Is All You Need笔记

最新推荐文章于 2024-05-23 19:01:25 发布

望non

最新推荐文章于 2024-05-23 19:01:25 发布

阅读量897

点赞数

文章标签：深度学习 transformer 自然语言处理

本文链接：https://blog.csdn.net/qq_42936876/article/details/121909370

版权

Attention Is All You Need

原文链接
 翻译

摘要

我们提出一种新的简单的网络架构Transformer，仅基于attention机制并完全避免循环和卷积。对两个机器翻译任务的实验表明，这些模型在质量上更加优越、并行性更好并且需要的训练时间显著减少。

引言

RNN、LSTM、gated rnn被用作最好的序列模型和转录模型，许多工作推动着它们的发展界限。通过在计算期间将位置与步骤对齐，它们根据前一步的隐藏状态ht-1和输入产生位置t的隐藏状态序列ht。这种固有的顺序特性阻碍样本训练的并行化，这在更长的序列长度上变得至关重要，因为有限的内存限制样本的批次大小。最近的工作通过巧妙的因子分解[21]和条件计算[32]在计算效率方面取得重大进展，后者还同时提高了模型性能。然而，顺序计算的基本约束依然存在。

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

(attention机制主要用于把编码器的东西有效的传给解码器)
在这项工作中我们提出Transformer，这种模型架构避免循环并完全依赖于attention机制来绘制输入和输出之间的全局依赖关系。 Transformer允许进行更多的并行化，并且可以在八个P100 GPU上接受少至十二小时的训练后达到翻译质量的新的最佳结果。

模型架构

第一段:
解释编码器与解码器。
auto-regressive：过去时刻的输出可以作为当前时刻的输入
第二段：
略

3.1Encoder and Decoder Stacks

Encoder:
每层包含两个子层，每个子层LayerNorm(x + Sublayer(x))
LayerNorm：按样本进行normalization
BatchNorm：按特征进行normalization
Decoder:
每层包含三个子层，加了掩码层来实现自回归auto-regressive，遮盖后边的向量，指利用前面得到的向量
Multi-Head Attention的输入依次为k,v,query

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,
where the query, keys, values, and output are all vectors. The output is computed as a weighted sum
of the values, where the weight assigned to each value is computed by a compatibility function of the
query with the corresponding key.

query,key,value
query和每个key比较，得到每个key对应的value的权重，每个value按权重相加，得到query的结果

3.3 Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

一个q和每个Key相乘，这个q和整个句子中的每个Key内积，得到每个q和句中每个词的相似度，除以 $\sqrt{d_k}$ 在进行softmax得到每个Value上的权值，权值向量再和V内积（即Value按权值相加）。当然这里用的q是n个q组成n维的Q（即n行，d_k维）。
Q(n,d_k),K(m,d_k),V(m,d_v),QK^T(n,m),QK^TV(n,d_v)
下边是为什么除以d_k

While for small values of dk the two mechanisms perform similarly, additive attention outperforms
dot product attention without scaling for larger values of dk [3]. We suspect that for large values of
dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients 4
. To counteract this effect, we scale the dot products by
$\frac{1}{\sqrt{d_k}}$

d_k大时，QK^T的softmax结果可能向两端走，梯度变小
图二：
左
MatMul指矩阵相乘
Mask：t时间刻只利用k₁,k₂,…,k_t-1的编码，其余替换为非常大的负数
右

MultiHead(Q,K,V)=Concat(head_1,...,head_h)W^O
\\where\quad head_i=Attention(QW^Q_i,KW^K_i,VW^V_i)

把Q,K,V投影h次，投影成 $d_k=d_v=\frac{d_{model}}{h}$
多头类似CV中图像的多通道

3.3 Position-wise Feed-Forward Networks

FFN(x)=max(0,xW_1+b_1)W_2+b_2

3.4 Embeddings and Softmax

In the embedding layers, we multiply those weights by $\sqrt{d}$ model.（词向量太长，每个数变小，相乘后和position相近）

3.5 Positional Encoding

attention没有位置信息，如何加入具体略

望non

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Attention Is All You Need笔记

Attention Is All You Need原文链接翻译摘要我们提出一种新的简单的网络架构Transformer，仅基于attention机制并完全避免循环和卷积。对两个机器翻译任务的实验表明，这些模型在质量上更加优越、并行性更好并且需要的训练时间显著减少。引言RNN、LSTM、gated rnn被用作最好的序列模型和转录模型，许多工作推动着它们的发展界限。通过在计算期间将位置与步骤对齐，它们根据前一步的隐藏状态ht-1和输入产生位置t的隐藏状态序列ht。这种固有的顺序特性阻碍样本训练的并
复制链接

扫一扫