“Attention Is All You Need”(个人笔记)

 “We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.” 我们提出了一种新的简单的网络架构Transformer,它完全基于注意力机制,完全不需要递归和卷积。

“1 Introduction”

“Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences .” 注意力机制已经成为各种任务中引人注目的序列建模和转导模型的一个组成部分,允许不考虑它们在输入或输出序列中的距离的依赖关系建模。

“The Transformer allows for significantly more parallelization” Transformer允许更多的并行化

“2 Background”

“In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2” 在Transformer中,关联来自两个任意输入或输出位置的信号所需的运算次数被减少到一个恒定的操作数量,尽管由于平均注意力加权位置而降低了有效分辨率,正如3.2节中描述的那样,我们用多头注意力来抵消这种影响

  • “Self-attention”

    “an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” 一种注意力机制,将单个序列的不同位置联系起来,以便计算序列的表示。

  • “End-to-end memory networks”

    “based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks” 基于循环注意力机制而不是序列对齐的循环,已经被证明在简单语言问答和语言建模任务上表现良好

“3 Model Architecture”

“the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.” 编码器将符号表示的输入序列( x1 , ... , xn)映射到连续表示的序列z = ( z1 , ... , zn)。给定z,解码器然后一次一个元素地生成一个输出序列( y1 , ... , ym)

“3.1 Encoder and Decoder Stacks”

“Encoder”

“The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network.” 编码器由N = 6个相同层的堆栈组成。每层有两个子层。第一种是多头自注意力机制,第二种是简单的位置全连接前馈网络。

“Decoder”

“the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack” 解码器插入第三个子层,它对编码器堆栈的输出执行多头注意力

“3.2 Attention”

“An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.” 注意力函数可以描述为将一个查询和一组键值对映射到一个输出,其中查询、键、值和输出都是向量。

“3.2.1 Scaled Dot-Product Attention”

“The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot products of the query with all keys, divide each by √dk, and apply a softmax function to obtain the weights on the values.” 输入由维度dk的查询和键以及维度dv的值组成。我们计算查询的所有键的点积,除以√dk,并应用softmax函数获得值上的权重。

“3.2.2 Multi-Head Attention”

“we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.” 我们发现将查询、键和值h次分别用不同的、学习过的线性投影到dk、dk和dv维度是有益的。

“3.2.3 Applications of Attention in our Model”

  • “In "encoder-decoder attention" layers,”

  • “The encoder contains self-attention layers”

  • “self-attention layers in the decoder”

“3.3 Position-wise Feed-Forward Networks”

“In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically.” 除了注意力子层之外,我们的编码器和解码器中的每一层都包含一个全连接的前馈网络,该网络分别且相同地应用于每个位置。

“3.4 Embeddings and Softmax”

“we use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.” 我们使用学习到的嵌入将输入令牌和输出令牌转换为维度dmodel的向量。我们还使用通常学习的线性变换和softmax函数将解码器输出转换为预测的下一个令牌概率。

“3.5 Positional Encoding”

“Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks.” 由于我们的模型不包含递归和不包含卷积,为了使模型能够利用序列的顺序,我们必须在序列中注入一些关于令牌相对或绝对位置的信息。为此,我们在编码器和解码器堆栈底部的输入嵌入中加入"位置编码"。

where pos is the position and i is the dimension.

“4 Why Self-Attention”

“One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required. The third is the path length between long-range dependencies in the network.” 一是每层的总计算复杂度。另一个是可以并行化的计算量,用所需的最小顺序操作数来衡量。第三是网络中长程依赖之间的路径长度。

“5 Training”

“5.1 Training Data and Batching”

“5.2 Hardware and Schedule”

“5.3 Optimizer”

“5.4 Regularization”

“6 Results”

“6.1 Machine Translation”

“6.2 Model Variations”

“English Constituency Parsing”

“7 Conclusion”

多模态的自注意力机制是指在多模态数据中,每个模态内部使用自注意力机制来提取模态内部的信息,并使用跨模态的注意力机制来融合不同模态之间的信息。通过将多模态融合推迟到模型的后期,可以更充分地提取单个模态内部的信息,因为不同模态的数据结构和分布差异很大,使用相同的处理方式可能不合理。在单个模态内部,仍然使用原始的自注意力机制,但在跨模态的融合中,使用各个模态的部分信息来进行跨注意力。除此之外,还可以限制层内不同模态注意力的流动,通过引入一组潜在的融合单元,形成"注意力瓶颈",跨模态的交互必须通过这些单元进行。这样既可以降低计算量,处理部分冗余信息,也可以使模型集中处理每个模态中最相关的输入,并只与其他模态共享必要的输入。因此,多模态自注意力机制在模型中起到了重要的作用。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [【多模态】《Attention Bottlenecks for Multimodal Fusion》论文阅读笔记](https://blog.csdn.net/qq_36643449/article/details/124968439)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *2* [【论文阅读】Attention Bottlenecks for Multimodal Fusion---多模态融合,音视频分类,注意力机制](https://blog.csdn.net/me_yundou/article/details/121070837)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] - *3* [Attention is all you need:关于transformer中的self-attention](https://blog.csdn.net/hands_up_down/article/details/122022802)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v92^chatsearchT3_1"}}] [.reference_item style="max-width: 33.333333333333336%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值