[paper] Attention is all you need 论文浅析

Base

Title: 《Attention Is All You Need》 2023

paper:arxiv

Github: None

Abstract

This paper proposed a new simple network architecture, the Transformer based solely on attention mechanisms.

Model Architecture

模型包括 encoder-decoder structure,encode 是将输入序列(x_1,x_2,x_3,...)映射成连续的表示序列:(z_1,z_2,z_3,...),decoder 模块将z作为输入,一次产生一个元素的output sequence:(y_1,y_2,y_3,...),每一个step,model 都是 auto-regressive[10],生成下一个symbols时候,先前的symbols都作为额外的输入。

整个transformer模型的架构如Figure 1 所示 encoder and decoder 中都使用stacked self-attention and point-wise, and fully connected layers。

备注:

[10] 《 Generating sequences with recurrent neural networks》https://arxiv.org/abs/1308.0850

Encoder and Decoder Stacks

Encoder:

Encoder 堆叠了6个相同的layer, each layer has two sub-layers, the first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network,每一个sub-layer 都使用了 residual connection,之后使用了 layer normalization,因此,sub-layer最后的输出是:layerNorm(x + sublayer(x)), outputs of dimension: d_{model}=512

Decoder:

Decoder 也是堆叠了6个相同的layer, decoder 相比于encoder的结构,增加了 third sub-layer,实现对encoder实现multi-head attention, 修改了self-attention sub-layer 防止attention 关注当前位置之后的位置的元素, output embeddings 将会被加入一个position offset,保证predictions only depend on position less than i

Attention:

Attention 可以看作是将query 和一组 key-value pairs 映射为一个output, query, keys, values, and output are all vectors.output 是value的weight sum, 每一个value的weight 是由query 和对应的key 通过function 计算出来的。

Attention

Scaled Dot-Product Attention

Attention 叫做:"Scaled Dot-Product Attention“,input 是query(dimension:d_k) and keys(dimension:d_k),values(dimension:d_v),计算query 和all keys 之间的dot products,结果除以\sqrt{d_k},之后使用softmax函数得到weight.公式如下:

attention(Q,K,V)=softmax(\frac{QK^T} {\sqrt{d_k}})V

其中: Q,K,V是query,key,value组成的矩阵;

最常使用的attention function是: additive attention and dot-product attention,本文使用的是dot-product attention,dot-product attention 速度更快,更加节省空间。如果value的d_k值较小,additive attention and dot-product attention 性能是相差不大的,如果d_k值较大,前者的性能是优于后者的,因为较大的d_k值,会让点积的增长幅度变大,将softmax压缩到极小的梯度,为了减少这个影响,本文使用缩放点积(\frac{1}{\sqrt{d_k}}).

Multi-Head Attention

本文不是使用single attention 将query,keys,value进行一次点积,而是将queries, keys and values使用h次不同的线性投影投影到d_k,d_k,d_vdimension是有好处的。投影的过程是并行的,输出的value的维度是d_v,之后这些value concatenate and onec again projected.

Multi-head attention 可以关注不同subspace 的不同位置的不同的表示, single attention head:

multiHead(Q,K,V)=concat(head_1,...,head_h)W^O

Where head_i=attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})

Applications of Attention in our Model

  1. the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder.This allows every position in the decoder to attend over all positions in the input sequence.

  2. The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place,the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

  3. 解码器中的self-attention层允许解码器中的每个位置关注解码器中直到并包括该位置的所有位置。 我们需要防止解码器中的左向信息流以保留auto-regressive属性。 通过屏蔽(设置为−∞)softmax 输入中与非法连接相对应的所有值来实现缩放点积注意力的内部。

Position-wise Feed-Forward Networks

A fully connected feed-forward network consists of two linear transformations with a ReLU activation in between.

Embeddings and Softmax

This paper uses learned embeddings to convert the input tokens and output tokens to vectors of dimension d_{model}. Also, we use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to [30]. In the embedding layers, we multiply those weights by \frac{1}{\sqrt{d_k}}.

Positional Encoding

source code

挖坑...

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值