Attention Is All You Need

本文提出了Transformer模型,一种仅依赖于注意力机制的机器翻译新方法,替代了传统的RNN和CNN。Transformer解决了RNN训练时无法并行化的难题,通过自注意力层处理全局依赖,并采用多头注意力和全连接前馈网络结构。使用位置编码来考虑序列顺序,提高了模型的效率和解释性。
摘要由CSDN通过智能技术生成

Attention Is All You Need

Abstract

任务:机器翻译
传统方法:RNN/CNN,可能加上attention机制
本文的方法:Transformer(变形金刚?变压器?),只用attention

Introduction

RNN’s inherent sequential nature precludes parallelization within training examples.
attention mechanisms are used in conjunction with a recurrent network
Transformer: relying entirely on an attention mechanism to draw global dependencies between input and output

Background

Self-attention(SAGAN中介绍过): intra-attention, relating different positions of a single sequence in order to compute a representation of the sequence

Model Architecture

encoder-decoder structure
encoder: (x1,...,xn)(z1,...,zn) ( x 1 , . . . , x n ) ↦ ( z 1 , . . . , z n )
auto-regressive: the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term)

Xt=c+i=1pϕiXti+ϵt X t = c + ∑ i = 1 p ϕ i X t − i + ϵ t

Scaled Dot-Product Attention: queries Q Q , keys K, values V V , dimension of keys dk,

Attention(Q,K,V)=softmax(QKTdk)V A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V

Multi-Head Attention: concatenate, project

MultiHead(Q,K,V)=Concat(head1,...,headh)WOheadi=Attention(QWQi,KWKi,VWVi) M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) W O h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V )

fully connected feed-forward network: 两层全连接网络

FFN(x)=max(0,xW1+b1)W2+b2 F F N ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2

Positional Encoding: make use of the order of the sequence, relative or absolute position of the tokens in the sequence.

PE(pos,2i)=sin(pos1042idmodel)PE(pos,2i+1)=cos(pos1042idmodel) P E ( p o s , 2 i ) = sin ⁡ ( p o s 10 4 ⋅ 2 i d m o d e l ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s 10 4 ⋅ 2 i d m o d e l )

allow the model to easily learn to attend by relative positions

Why Self-Attention

  1. the total computational complexity per layer
  2. the amount of computation that can be parallelized
  3. the path length between long-range dependencies in the network
  4. yield more interpretable models
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值