这是一篇技术派文章,尤其是其中的绘制于“餐巾纸”上的手绘图,从数学角度对于大语言模型的架构给你一些新的启发。
原文链接:https://dugas.ch/artificial_curiosity/GPT_architecture.html
作者:Daniel Dugas
翻译/编辑:liyane 使用LLM Chat API翻译;为了方便对照,把英文原文也对应在每段中文翻译之下。
现在马上跟随作者开始一次开心的旅程:
目录
GPT-3架构,简述于“餐巾纸”上
有太多关于GPT-3的精彩文章,展示了它能做什么,思考其后果,可视化其工作原理。即使有这些,我仍需要查阅多篇论文和博客,才确信我已经理解了架构。
因此,此页面的目标很简单:帮助其他人尽可能详细地理解GPT-3架构。
如果你没有耐心看完过程和细节,直接跳到完整架构草图。
The GPT-3 Architecture, on a Napkin
There are so many brilliant posts on GPT-3, demonstrating what it can do, pondering its consequences, vizualizing how it works. With all these out there, it still took a crawl through several papers and blogs before I was confident that I had grasped the architecture.
So the goal for this page is humble, but simple: help others build an as detailed as possible understanding of the GPT-3 architecture.
Or if you’re impatient, jump straight to the full-architecture sketch.
原始图表
作为起点,原始的transformer和GPT论文[1][2][^3]为我们提供了以下图表:
就图表而言,不错,但如果你像我一样,这些还不足以理解全部画面。那么,让我们深入了解!
Original Diagrams
As a starting point, the original transformer and GPT papers[1][2][3] provide us with the following diagrams:
Not bad as far as diagrams go, but if you’re like me, not enough to understand the full picture. So let’s dig in!
输入/输出
在我们能理解任何其他事情之前,我们需要知道:GPT的输入和输出是什么?
输入是N个单词(也称为标记)的序列。 输出是一个猜测,猜测最有可能被放在输入序列末尾的单词。
就是这样!你看到的所有令人印象深刻的GPT对话、故事和示例都是用这个简单的输入输出方案制作的:给它一个输入序列 - 获得下一个单词。
Not all heroes wear-> capes
当然,我们通常想要获得不止一个单词,但那不是问题:获取下一个单词后,我们将其添加到序列中,并获取下一个单词。
Not all heroes wear capes -> but
Not all heroes wear capes but -> all
Not all heroes wear capes but all -> villans
Not all heroes wear capes but all villans -> do
根据需要重复此过程,你最终会得到长篇生成的文本。
实际上,为了准确,我们需要在两个方面纠正上述内容。
-
输入序列实际上固定为2048个单词(对于GPT-3)。我们仍然可以传入短序列作为输入:我们简单地用“空”值填充所有额外的位置。
-
GPT的输出不仅仅是一个猜测,而是一个序列(长度2048)的猜测(每个可能的单词一个概率)。序列中的每个’下一个’位置都有一个。但在生成文本时,我们通常只看序列的最后一个单词的猜测。
就是这样!序列进,序列出。
In / Out
Before we can understand anything else, we need to know: what are the inputs and outputs of GPT?
The input is a sequence of N words (a.k.a tokens). The output is a guess for the word most likely to be put at the end of the input sequence.
That’s it! All the impressive GPT dialogues, stories and examples you see posted around are made with this simple input-output scheme: give it an input sequence – get the next word.
Not all heroes wear -> capes
Of course, we often want to get more than one word, but that’s not a problem: after we get the next word, we add it to the sequence, and get the following word.
Not all heroes wear capes -> but
Not all heroes wear capes but -> all
Not all heroes wear capes but all -> villans
Not all heroes wear capes but all villans -> do
repeat as much as desired, and you end up with long generated texts.
Actually, to be precise, we need to correct the above in two aspects.
- The input sequence is actually fixed to 2048 words (for GPT-3). We can still pass short sequences as input: we simply fill all extra positions with “empty” values.
- The GPT output is not just a single guess, it’s