小周带你读论文-2之“草履虫都能看懂的Transformer老活儿新整“Attention is all you need(1)

最新推荐文章于 2024-10-09 00:00:00 发布

周博洋K

最新推荐文章于 2024-10-09 00:00:00 发布

阅读量1.2k

点赞数 36

文章标签： transformer 深度学习人工智能

本文链接：https://blog.csdn.net/kingsoftcloud/article/details/135512881

版权

本文是对经典论文《Attention is all you need》的个人解读，旨在以简单易懂的方式介绍Transformer架构。文章指出Transformer解决了传统Seq2Seq模型的时序计算问题，通过对比GPT和BERT的训练方式，探讨了为何Decoder-only架构逐渐成为主流，并讨论了Decoder-only的表达能力和优势。此外，还涉及了低秩问题和零样本测试，证明Decoder-only模型的优越性。

摘要由CSDN通过智能技术生成

这论文其实也不用多说了，我相信百分之70以上我的读者读过

但是还是老规矩 1,2,3 上链接

1706.03762.pdf (arxiv.org)

《Attention is all you need》我如果干讲这个可能有点枯燥，毕竟好多人看过，但是这个论文又是玩LLM不可能跨过的一篇文章，所以我站在我的角度夹带点私货来对这个论文做一些个人解读，保证你们看到一篇不一样的，更丰富内容的“Attention is all you need”

我的目的就是一定要让大家明白，所以会讲的很细，希望出一个能让草履虫都能看懂的Transformer论文解析

我就只沾一部分原文就是background：

Background The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations [4, 27, 28, 22]. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answeri