Build a Large Language Model (From Scratch)GPT-4o翻译和代码每行中文注释Ch 1-CSDN博客

中文理解代码

https://github.com/Czi24/Awesome-MLLM-LLM-Colab/tree/master/LLMs-from-scratch-CN-Colab

1 Understanding Large Language Models

1 理解大型语言模型

This chapter covers

High-level explanations of the fundamental concepts behind large language models (LLMs)
Insights into the transformer architecture from which LLMs, such as the ones used on the ChatGPT platform, are derived
A plan for building an LLM from scratch

本章内容

高层次解释大型语言模型（LLMs）背后的基本概念
对LLMs（如ChatGPT平台上使用的模型）的transformer架构的深入理解
从零开始构建LLM的计划

Large language models (LLMs), such as those offered in OpenAI’s ChatGPT, are deep neural network models that have been developed over the past few years. They ushered in a new era for Natural Language Processing (NLP). Before the advent of large language models, traditional methods excelled at categorization tasks such as email spam classification and straightforward pattern recognition that could be captured with handcrafted rules or simpler models. However, they typically underperformed in language tasks that demanded complex understanding and generation abilities, such as parsing detailed instructions, conducting contextual analysis, or creating coherent and contextually appropriate original text. For example, previous generations of language models could not write an email from a list of keywords—a task that is trivial for contemporary LLMs.

大型语言模型（LLMs），如OpenAI的ChatGPT中提供的模型，是在过去几年中开发的深度神经网络模型。它们为自然语言处理（NLP）开创了一个新时代。在大型语言模型出现之前，传统方法在分类任务（如电子邮件垃圾分类）和可以通过手工规则或更简单模型捕捉的简单模式识别方面表现出色。然而，它们通常在需要复杂理解和生成能力的语言任务中表现不佳，如解析详细说明、进行上下文分析或创建连贯且上下文适当的原始文本。例如，以前的语言模型无法从关键字列表中写一封电子邮件——这对当代LLMs来说是一项微不足道的任务。

LLMs have remarkable capabilities to understand, generate, and interpret human language. However, it’s important to clarify that when we say language models “understand,” we mean that they can process and generate text in ways that appear coherent and contextually relevant, not that they possess human-like consciousness or comprehension.

LLMs具有非凡的理解、生成和解释人类语言的能力。然而，需要明确的是，当我们说语言模型“理解”时，我们的意思是它们可以以看似连贯且上下文相关的方式处理和生成文本，而不是它们具有类似人类的意识或理解能力。

Enabled by advancements in deep learning, which is a subset of machine learning and artificial intelligence (AI) focused on neural networks, LLMs are trained on vast quantities of text data. This allows LLMs to capture deeper contextual information and subtleties of human language compared to previous approaches. As a result, LLMs have significantly improved performance in a wide range of NLP tasks, including text translation, sentiment analysis, question answering, and many more.

由于深度学习的进步——深度学习是专注于神经网络的机器学习和人工智能（AI）的一个子集——LLMs是在大量文本数据上进行训练的。这使得LLMs能够比以往方法更深刻地捕捉上下文信息和人类语言的细微差别。因此，LLMs在包括文本翻译、情感分析、问答等广泛的NLP任务中表现显著提高。

Another important distinction between contemporary LLMs and earlier NLP models is that these earlier NLP models were typically designed for specific tasks, for example, text categorization, language translation and so forth. Whereas those earlier NLP models excelled in their narrow applications, LLMs demonstrate a broader proficiency across a wide range of NLP tasks.

现代LLMs与早期NLP模型的另一个重要区别是，早期的NLP模型通常是为特定任务设计的，例如文本分类、语言翻译等。虽然这些早期的NLP模型在其狭窄的应用中表现出色，但LLMs在广泛的NLP任务中表现出更广泛的能力。

The success behind LLMs can be attributed to the transformer architecture which underpins many LLMs, and the vast amounts of data LLMs are trained on, allowing them to capture a wide variety of linguistic nuances, contexts, and patterns that would be challenging to manually encode.

LLMs的成功可以归因于支撑许多LLMs的transformer架构，以及LLMs训练所用的大量数据，使它们能够捕捉到各种语言细微差别、上下文和模式，这些是手工编码难以实现的。

This shift towards implementing models based on the transformer architecture and using large training datasets to train LLMs has fundamentally transformed NLP, providing more capable tools for understanding and interacting with human language.

这种向基于transformer架构实施模型并使用大型训练数据集训练LLMs的转变，从根本上改变了NLP，提供了更强大的工具来理解和与人类语言互动。

Beginning with this chapter, we set the foundation to accomplish the primary objective of this book: understanding LLMs by implementing a ChatGPT-like LLM based on the transformer architecture step by step in code.

从本章开始，我们为实现本书的主要目标奠定基础：通过一步步在代码中实现基于transformer架构的类似ChatGPT的LLM，来理解LLMs。

1.1 What is an LLM?

An LLM, a large language model, is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet.

LLM，即大型语言模型，是一种设计用于理解、生成和响应类似人类文本的神经网络。这些模型是训练在大量文本数据上的深度神经网络，有时涵盖了互联网上公开可用的文本的大部分。

The “large” in large language model refers to both the model’s size in terms of parameters and the immense dataset on which it’s trained. Models like this often have tens or even hundreds of billions of parameters, which are the adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next-word prediction is sensible because it harnesses the inherent sequential nature of language to train models on understanding context, structure, and relationships within text. Yet, it is a very simple task and so it is surprising to many researchers that it can produce such capable models. We will discuss and implement the next-word training procedure in later chapters step by step.

大型语言模型中的“大型”指的是模型在参数方面的大小以及它所训练的庞大数据集。像这样的模型通常有数十亿甚至数千亿个参数，这些参数是网络中可调的权重，在训练过程中被优化以预测序列中的下一个词。下一词预测是合理的，因为它利用了语言固有的顺序性来训练模型理解文本中的上下文、结构和关系。然而，这是一项非常简单的任务，因此它能产生如此强大的模型令许多研究人员感到惊讶。我们将在后续章节中逐步讨论和实现下一词训练过程。

LLMs utilize an architecture called the transformer (covered in more detail in section 1.4), which allows them to pay selective attention to different parts of the input when making predictions, making them especially adept at handling the nuances and complexities of human language.

LLMs使用一种称为transformer的架构（在第1.4节中有更详细的介绍），这种架构允许它们在进行预测时对输入的不同部分进行选择性注意，从而使它们特别擅长处理人类语言的细微差别和复杂性。

Since LLMs are capable of generating text, LLMs are also often referred to as a form of generative artificial intelligence (AI), often abbreviated as generative AI or GenAI. As illustrated in Figure 1.1, AI encompasses the broader field of creating machines that can perform tasks requiring human-like intelligence, including understanding language, recognizing patterns, and making decisions, and includes subfields like machine learning and deep learning.

由于LLMs能够生成文本，LLMs也常被称为生成式人工智能（AI），通常缩写为生成式AI或GenAI。如图1.1所示，AI包括创造能够执行需要类似人类智能的任务的机器的更广泛领域，包括理解语言、识别模式和做出决策，并包括机器学习和深度学习等子领域。

在这里插入图片描述

Figure 1.1 As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, leveraging their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multi-layer neural networks. And machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence.

图1.1 这一层次关系的图示表明，LLMs代表了深度学习技术的一个特定应用，利用其处理和生成类似人类文本的能力。深度学习是机器学习的一个专门分支，专注于使用多层神经网络。而机器学习和深度学习领域旨在实现使计算机能够从数据中学习并执行通常需要人类智能的任务的算法。

The algorithms used to implement AI are the focus of the field of machine learning. Specifically, machine learning involves the development of algorithms that can learn from and make predictions or decisions based on data without being explicitly programmed. To illustrate this, imagine a spam filter as a practical application of machine learning. Instead of manually writing rules to identify spam emails, a machine learning algorithm is fed examples of emails labeled as spam and legitimate emails. By minimizing the error in its predictions on a training dataset, the model then learns to recognize patterns and characteristics indicative of spam, enabling it to classify new emails as either spam or legitimate.

用于实现AI的算法是机器学习领域的重点。具体来说，机器学习涉及开发能够从数据中学习并在不被显式编程的情况下基于数据做出预测或决策的算法。为了说明这一点，可以想象一个垃圾邮件过滤器作为机器学习的实际应用。与其手动编写规则来识别垃圾邮件，不如让机器学习算法接收标记为垃圾邮件和合法邮件的示例。通过最小化其在训练数据集上的预测误差，模型随后学会识别表明垃圾邮件的模式和特征，从而使其能够将新邮件分类为垃圾邮件或合法邮件。

As illustrated in Figure 1.1, deep learning is a subset of machine learning that focuses on utilizing neural networks with three or more layers (also called deep neural networks) to model complex patterns and abstractions in data. In contrast to deep learning, traditional machine learning requires manual feature extraction. This means that human experts need to identify and select the most relevant features for the model.

如图1.1所示，深度学习是机器学习的一个子集，专注于使用具有三层或更多层的神经网络（也称为深度神经网络）来建模数据中的复杂模式和抽象。与深度学习相比，传统机器学习需要手动特征提取。这意味着人类专家需要识别并选择对模型最相关的特征。
While the field of AI is nowadays dominated by machine learning and deep learning, it also includes other approaches, for example, using rule-based systems, genetic algorithms, expert systems, fuzzy logic, or symbolic reasoning.

虽然AI领域现在主要由机器学习和深度学习主导，但它也包括其他方法，例如基于规则的系统、遗传算法、专家系统、模糊逻辑或符号推理。

Returning to the spam classification example, in traditional machine learning, human experts might manually extract features from email text such as the frequency of certain trigger words (“prize,” “win,” “free”), the number of exclamation marks, use of all uppercase words, or the presence of suspicious links. This dataset, created based on these expert-defined features, would then be used to train the model. In contrast to traditional machine learning, deep learning does not require manual feature extraction. This means that human experts do not need to identify and select the most relevant features for a deep learning model. (However, in both traditional machine learning and deep learning for spam classification, you still require the collection of labels, such as spam or non-spam, which need to be gathered either by an expert or users.)

回到垃圾邮件分类的例子，在传统的机器学习中，人类专家可能会手动从电子邮件文本中提取特征，如某些触发词的频率（“奖品”、“赢”、“免费”）、感叹号的数量、所有大写字母的使用或可疑链接的存在。这个基于这些专家定义的特征创建的数据集，然后将用于训练模型。与传统机器学习相比，深度学习不需要手动特征提取。这意味着人类专家不需要识别和选择最相关的特征用于深度学习模型。（然而，无论是传统机器学习还是深度学习用于垃圾邮件分类，仍然需要收集标签，如垃圾邮件或非垃圾邮件，这些标签需要由专家或用户收集。）

The upcoming sections will cover some of the problems LLMs can solve today, the challenges that LLMs address, and the general LLM architecture, which we will implement in this book.

接下来的章节将涵盖LLMs今天可以解决的一些问题，LLMs应对的挑战以及我们将在本书中实现的通用LLM架构。

1.2 Applications of LLMs

Owing to their advanced capabilities to parse and understand unstructured text data, LLMs have a broad range of applications across various domains. Today, LLMs are employed for machine translation, generation of novel texts (see Figure 1.2), sentiment analysis, text summarization, and many other tasks. LLMs have recently been used for content creation, such as writing fiction, articles, and even computer code.

由于其解析和理解非结构化文本数据的先进能力，LLMs在各个领域有着广泛的应用。今天，LLMs被用于机器翻译、新文本生成（见图1.2）、情感分析、文本摘要等许多任务。最近，LLMs还被用于内容创作，如写小说、文章，甚至编写计算机代码。

在这里插入图片描述

Figure 1.2 LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user’s specifications.

图1.2 LLM界面使用户和AI系统之间的自然语言交流成为可能。此屏幕截图显示ChatGPT根据用户的规格编写诗歌。

LLMs can also power sophisticated chatbots and virtual assistants, such as OpenAI’s ChatGPT or Google’s Gemini (formerly called Bard), which can answer user queries and augment traditional search engines such as Google Search or Microsoft Bing.

LLMs还可以驱动复杂的聊天机器人和虚拟助手，如OpenAI的ChatGPT或Google的Gemini（前称Bard），它们可以回答用户查询并增强传统搜索引擎如Google Search或Microsoft Bing的功能。

Moreover, LLMs may be used for effective knowledge retrieval from vast volumes of text in specialized areas such as medicine or law. This includes sifting through documents, summarizing lengthy passages, and answering technical questions.

此外，LLMs还可以用于从大量文本中有效检索专业领域（如医学或法律）的知识。这包括筛选文档、总结冗长的段落和回答技术问题。

In short, LLMs are invaluable for automating almost any task that involves parsing and generating text. Their applications are virtually endless, and as we continue to innovate and explore new ways to use these models, it’s clear that LLMs have the potential to redefine our relationship with technology, making it more conversational, intuitive, and accessible.

简而言之，LLMs对于自动化几乎任何涉及解析和生成文本的任务都是无价的。它们的应用几乎是无穷无尽的，随着我们不断创新和探索使用这些模型的新方法，很明显，LLMs有潜力重新定义我们与技术的关系，使其更加对话化、直观化和易于访问。

In this book, we will focus on understanding how LLMs work from the ground up, coding an LLM that can generate texts. We will also learn about techniques that allow LLMs to carry out queries, ranging from answering questions to summarizing text, translating text into different languages, and more. In other words, in this book, we will learn how complex LLM assistants such as ChatGPT work by building one step by step.

在本书中，我们将专注于从头开始了解LLMs的工作原理，编码一个可以生成文本的LLM。我们还将学习允许LLMs执行查询的技术，从回答问题到总结文本、将文本翻译成不同语言等。换句话说，在本书中，我们将通过逐步构建一个复杂的LLM助手，如ChatGPT，来了解其工作原理。

1.3 Stages of building and using LLMs

Why should we build our own LLMs? Coding an LLM from the ground up is an excellent exercise to understand its mechanics and limitations. Also, it equips us with the required knowledge for pretraining or finetuning existing open-source LLM architectures to our own domain-specific datasets or tasks.

为什么我们要构建自己的LLMs？从头开始编码一个LLM是理解其机制和局限性的极好练习。此外，它使我们具备了对现有开源LLM架构进行预训练或微调，以适应我们自己的领域特定数据集或任务所需的知识。

Research has shown that when it comes to modeling performance, custom-built LLMs—those tailored for specific tasks or domains—can outperform general-purpose LLMs, such as those provided by ChatGPT, which are designed for a wide array of applications. Examples of this include BloombergGPT, which is specialized for finance, and LLMs that are tailored for medical question answering (please see the Further Reading and References section in Appendix B for more details).

研究表明，在模型性能方面，定制的LLMs——那些针对特定任务或领域量身定制的LLMs——可以超过通用LLMs，例如ChatGPT提供的那些LLMs，这些LLMs设计用于广泛的应用。这方面的例子包括专为金融设计的BloombergGPT和专为医学问答量身定制的LLMs（有关更多详细信息，请参阅附录B中的进一步阅读和参考部分）。

Using custom-built LLMs offers several advantages, particularly regarding data privacy. For instance, companies may prefer not to share sensitive data with third-party LLM providers like OpenAI due to confidentiality concerns. Additionally, developing custom LLMs enables deployment directly on customer devices, such as laptops and smartphones, which is something companies like Apple are currently exploring. This local implementation can significantly decrease latency and reduce server-related costs. Furthermore, custom LLMs grant developers complete autonomy, allowing them to control updates and modifications to the model as needed.

使用定制的LLMs提供了几个优势，尤其是在数据隐私方面。例如，出于保密性考虑，公司可能不愿与第三方LLM提供商（如OpenAI）分享敏感数据。此外，开发定制的LLMs可以直接在客户设备（如笔记本电脑和智能手机）上部署，这是像苹果这样的公司目前正在探索的。这种本地实现可以显著降低延迟并减少与服务器相关的成本。此外，定制的LLMs赋予开发人员完全的自主权，允许他们根据需要控制模型的更新和修改。

The general process of creating an LLM includes pretraining and finetuning. The term “pre” in “pretraining” refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through finetuning, a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains. This two-stage training approach consisting of pretraining and finetuning is depicted in Figure 1.3.

创建LLM的一般过程包括预训练和微调。“预训练”中的“预”指的是初始阶段，在这一阶段，像LLM这样的模型在一个大型、多样化的数据集上进行训练，以发展对语言的广泛理解。然后，这个预训练模型作为基础资源，可以通过微调进一步优化，微调是指模型在一个更狭窄的数据集上进行专门训练，该数据集更适合特定任务或领域。这种由预训练和微调组成的两阶段训练方法在图1.3中有所描述。

As illustrated in Figure 1.3, the first step in creating an LLM is to train it on a large corpus of text data, sometimes referred to as raw text. Here, “raw” refers to the fact that this data is just regular text without any labeling information. (Filtering may be applied, such as removing formatting characters or documents in unknown languages.)

如图1.3所示，创建LLM的第一步是用大量的文本数据进行训练，有时称为原始文本。这里，“原始”指的是这些数据只是没有任何标签信息的常规文本。（可能会进行过滤，如去除格式字符或未知语言的文档。）

This first training stage of an LLM is also known as pretraining, creating an initial pretrained LLM, often called a base or foundation model. A typical example of such a model is the GPT-3 model (the precursor of the original model offered in ChatGPT). This model is capable of text completion, that is, finishing a half-written sentence provided by a user. It also has limited few-shot capabilities, which means it can learn to perform new tasks based on only a few examples instead of needing extensive training data. This is further illustrated in the next section, Introducing the transformer architecture.

LLM的这一初始训练阶段也称为预训练，创建一个初始预训练LLM，通常称为基础或基础模型。这类模型的一个典型例子是GPT-3模型（ChatGPT中提供的原始模型的前身）。该模型能够完成文本，即完成用户提供的一半写好的句子。它还具有有限的少样本能力，这意味着它可以基于少量示例学习执行新任务，而不需要大量的训练数据。下一节将进一步说明，即介绍transformer架构。

After obtaining a pretrained LLM from training on large text datasets, where the LLM is trained to predict the next word in the text, we can further train the LLM on labeled data, also known as finetuning.

在从大型文本数据集中训练获得预训练LLM后，该LLM被训练来预测文本中的下一个词，我们可以进一步在标记数据上训练LLM，也称为微调。

The two most popular categories of finetuning LLMs include instruction-finetuning and finetuning for classification tasks. In instruction-finetuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text. In classification finetuning, the labeled dataset consists of texts and associated class labels, for example, emails associated with spam and non-spam labels.

微调LLM的两种最常见类别包括指令微调和分类任务的微调。在指令微调中，标记数据集由指令和答案对组成，例如一个翻译文本的查询及其正确翻译的文本。在分类微调中，标记数据集由文本及其相关的类别标签组成，例如与垃圾邮件和非垃圾邮件标签相关的电子邮件。

In this book, we will cover both code implementations for pretraining and finetuning an LLM, and we will delve deeper into the specifics of instruction-finetuning and finetuning for classification later in this book after pretraining a base LLM.

在本书中，我们将涵盖预训练和微调LLM的代码实现，并且在本书后面将在预训练基础LLM之后深入探讨指令微调和分类微调的细节。
在这里插入图片描述

Figure 1.3 Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be finetuned using a smaller labeled dataset.

图1.3 预训练LLM涉及在大型文本数据集上进行下一词预测。然后可以使用较小的标记数据集对预训练的LLM进行微调。

1.4 Introducing the transformer architecture

Most modern LLMs rely on the transformer architecture, which is a deep neural network architecture introduced in the 2017 paper Attention Is All You Need. To understand LLMs we briefly have to go over the original transformer, which was originally developed for machine translation, translating English texts to German and French. A simplified version of the transformer architecture is depicted in Figure 1.4.

大多数现代LLMs依赖于transformer架构，这是一种在2017年论文《Attention Is All You Need》中引入的深度神经网络架构。为了理解LLMs，我们简要回顾一下最初的transformer，它最初是为机器翻译开发的，将英文文本翻译成德文和法文。图1.4展示了transformer架构的简化版本。

在这里插入图片描述

Figure 1.4 A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word (“Beispiel”), given the original input text (“This is an example”) and a partially translated sentence (“Das ist ein”), to complete the translation.

图1.4 原始transformer架构的简化描述，这是一种用于语言翻译的深度学习模型。transformer由两个部分组成，一个是处理输入文本并生成嵌入表示（一个在不同维度上捕捉许多不同因素的数值表示）的编码器，解码器可以用来一次生成一个单词的翻译文本。请注意，此图显示了翻译过程的最后阶段，解码器只需生成最后一个单词（“Beispiel”），给定原始输入文本（“This is an example”）和部分翻译的句子（“Das ist ein”），以完成翻译。

The transformer architecture depicted in Figure 1.4 consists of two submodules, an encoder and a decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. Then, the decoder module takes these encoded vectors and generates the output text from them. In a translation task, for example, the encoder would encode the text from the source language into vectors, and the decoder would decode these vectors to generate text in the target language. Both the encoder and decoder consist of many layers connected by a so-called self-attention mechanism. You may have many questions regarding how the inputs are preprocessed and encoded. These will be addressed in a step-by-step implementation in the subsequent chapters.

图1.4所示的transformer架构由两个子模块组成，一个是编码器，另一个是解码器。编码器模块处理输入文本，并将其编码成一系列数值表示或向量，以捕捉输入的上下文信息。然后，解码器模块从这些编码的向量生成输出文本。例如，在翻译任务中，编码器将源语言的文本编码成向量，解码器将这些向量解码以生成目标语言的文本。编码器和解码器都由许多层组成，这些层通过所谓的自注意机制连接。你可能会有很多关于输入如何预处理和编码的问题。这些将在后续章节中的逐步实现中得到解答。

A key component of transformers and LLMs is the self-attention mechanism (not shown), which allows the model to weigh the importance of different words or tokens in a sequence relative to each other. This mechanism enables the model to capture long-range dependencies and contextual relationships within the input data, enhancing its ability to generate coherent and contextually relevant output. However, due to its complexity, we will defer the explanation to chapter 3, where we will discuss and implement it step by step. Moreover, we will also discuss and implement the data preprocessing steps to create the model inputs in chapter 2, Working with Text Data.

transformer和LLMs的一个关键组件是自注意机制（未显示），它允许模型衡量序列中不同单词或词元相对于彼此的重要性。这种机制使模型能够捕捉输入数据中的长距离依赖关系和上下文关系，增强其生成连贯且上下文相关输出的能力。然而，由于其复杂性，我们将推迟到第3章讨论和实现它，在那里我们将逐步讨论和实现它。此外，我们还将在第2章《处理文本数据》中讨论和实现创建模型输入的数据预处理步骤。

Later variants of the transformer architecture, such as the so-called BERT (short for bidirectional encoder representations from transformers) and the various GPT models (short for generative pretrained transformers), built on this concept to adapt this architecture for different tasks. (References can be found in Appendix B.)

transformer架构的后续变体，如所谓的BERT（transformer双向编码器表示的缩写）和各种GPT模型（生成预训练transformer的缩写），基于这一概念，调整此架构以适应不同的任务。（参考文献可在附录B中找到。）

BERT, which is built upon the original transformer’s encoder submodule, differs in its training approach from GPT. While GPT is designed for generative tasks, BERT and its variants specialize in masked word prediction, where the model predicts masked or hidden words in a given sentence as illustrated in Figure 1.5. This unique training strategy equips BERT with strengths in text classification tasks, including sentiment prediction and document categorization. As an application of its capabilities, as of this writing, Twitter uses BERT to detect toxic content.

BERT基于原始transformer的编码器子模块，在其训练方法上与GPT不同。虽然GPT是为生成任务设计的，但BERT及其变体专注于掩码词预测，即模型预测给定句子中被掩码或隐藏的词，如图1.5所示。这种独特的训练策略使BERT在文本分类任务中具有优势，包括情感预测和文档分类。作为其能力的一种应用，截至撰写本文时，Twitter使用BERT检测有害内容。

在这里插入图片描述

Figure 1.5 A visual representation of the transformer’s encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.

图1.5 transformer的编码器和解码器子模块的可视化表示。左侧的编码器部分示例了类似BERT的LLMs，它们专注于掩码词预测，主要用于文本分类等任务。右侧的解码器部分展示了类似GPT的LLMs，设计用于生成任务并生成连贯的文本序列。

GPT, on the other hand, focuses on the decoder portion of the original transformer architecture and is designed for tasks that require generating texts. This includes machine translation, text summarization, fiction writing, writing computer code, and more. We will discuss the GPT architecture in more detail in the remaining sections of this chapter and implement it from scratch in this book.

而GPT则专注于原始transformer架构的解码器部分，设计用于需要生成文本的任务。这包括机器翻译、文本摘要、小说写作、编写计算机代码等。我们将在本章的其余部分更详细地讨论GPT架构，并在本书中从头开始实现它。

GPT models, primarily designed and trained to perform text completion tasks, also show remarkable versatility in their capabilities. These models are adept at executing both zero-shot and few-shot learning tasks. Zero-shot learning refers to the ability to generalize to completely unseen tasks without any prior specific examples. On the other hand, few-shot learning involves learning from a minimal number of examples the user provides as input, as shown in Figure 1.6.

GPT模型主要设计和训练用于执行文本完成任务，在其能力方面也表现出显著的多功能性。这些模型擅长执行零样本和少样本学习任务。零样本学习指的是在没有任何先前特定示例的情况下推广到完全未知的任务的能力。另一方面，少样本学习涉及从用户提供的少量示例中学习，如图1.6所示。
在这里插入图片描述

Figure 1.6 In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, finetuning, or task-specific model architecture changes. Sometimes, it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting.

图1.6 除了文本完成外，类似GPT的LLMs可以根据其输入解决各种任务，而无需重新训练、微调或特定任务的模型架构更改。有时，提供目标示例在输入中是有帮助的，这称为少样本设置。然而，类似GPT的LLMs也能够在没有特定示例的情况下执行任务，这称为零样本设置。

1.5 Utilizing large datasets

The large training datasets for popular GPT- and BERT-like models represent diverse and comprehensive text corpora encompassing billions of words, which include a vast array of topics and natural and computer languages. To provide a concrete example, Table 1.1 summarizes the dataset used for pretraining GPT-3, which served as the base model for the first version of ChatGPT.

流行的GPT和BERT模型的大型训练数据集代表了多样且全面的文本语料库，涵盖了数十亿个单词，包括大量主题以及自然语言和计算机语言。为了提供一个具体的例子，表1.1总结了用于预训练GPT-3的数据集，该模型作为ChatGPT第一个版本的基础模型。

Table 1.1 The pretraining dataset of the popular GPT-3 LLM

表1.1 流行的GPT-3 LLM的预训练数据集

Dataset name	Dataset description	Number of tokens	Proportion in training data
CommonCrawl (filtered)	Web crawl data	410 billion	60%
WebText2	Web crawl data	19 billion	22%
Books1	Internet-based book corpus	12 billion	8%
Books2	Internet-based book corpus	55 billion	8%
Wikipedia	High-quality text	3 billion	3%

表1.1中报告了词元的数量，词元是模型读取的文本单位，数据集中的词元数量大致相当于文本中的单词和标点符号数量。我们将在下一章中更详细地讨论词元化，即将文本转换为词元的过程。

The main takeaway is that the scale and diversity of this training dataset allows these models to perform well on diverse tasks including language syntax, semantics, and context, and even some requiring general knowledge.

主要结论是，这些训练数据集的规模和多样性使这些模型在包括语言语法、语义和上下文在内的各种任务中表现良好，甚至包括一些需要一般知识的任务。

GPT-3 DATASET DETAILS

Table 1.1 displays the dataset used for GPT-3. The proportions column in the table sums up to 100% of the sampled data, adjusted for rounding errors. Although the subsets in the “Number of Tokens” column total 509 billion, the model was trained on only 300 billion tokens. The authors of the GPT-3 paper did not specify why the model was not trained on all 509 billion tokens.

表1.1显示了用于GPT-3的数据集。表中的比例列汇总为100%的采样数据，经过四舍五入调整。尽管“词元数量”列中的子集总计为5090亿，但该模型仅在3000亿词元上进行了训练。GPT-3论文的作者没有具体说明为什么模型没有在所有5090亿词元上进行训练。

For context, consider the size of the CommonCrawl dataset, which alone consists of 410 billion tokens and requires about 570 GB of storage. In comparison, later iterations of models like GPT-3, such as Meta’s LLaMA, have expanded their training scope to include additional data sources like Arxiv research papers (92 GB) and StackExchange’s code-related Q&As (78 GB).

为了更好地理解，可以考虑CommonCrawl数据集的规模，仅该数据集就包含4100亿个词元，并且需要约570GB的存储空间。相比之下，GPT-3之后的模型迭代，如Meta的LLaMA，扩大了其训练范围，包含了额外的数据源，如Arxiv研究论文（92GB）和StackExchange的与代码相关的问答（78GB）。

The authors of the GPT-3 paper did not share the training dataset but a comparable dataset that is publicly available is Dolma: an Open Corpus of Three Trillion Tokens for LLM Pretraining Research by Soldaini et al. 2024 (https://arxiv.org/abs/2402.00159). However, the collection may contain copyrighted works, and the exact usage terms may depend on the intended use case and country.

GPT-3论文的作者没有分享训练数据集，但一个可比的数据集是公开可用的Dolma：一个用于LLM预训练研究的三万亿词元开放语料库，由Soldaini等人于2024年发布（https://arxiv.org/abs/2402.00159）。然而，该集合可能包含受版权保护的作品，确切的使用条款可能取决于预期的使用案例和国家。

The pretrained nature of these models makes them incredibly versatile for further finetuning on downstream tasks, which is why they are also known as base or foundation models. Pretraining LLMs requires access to significant resources and is very expensive. For example, the GPT-3 pretraining cost is estimated to be $4.6 million in terms of cloud computing credits.

这些模型的预训练特性使它们在进一步微调下游任务时具有极大的多功能性，这也是为什么它们被称为基础或基础模型。预训练LLMs需要大量资源，费用非常昂贵。例如，GPT-3的预训练成本估计为460万美元的云计算积分。

The good news is that many pretrained LLMs, available as open-source models, can be used as general purpose tools to write, extract, and edit texts that were not part of the training data. Also, LLMs can be finetuned on specific tasks with relatively smaller datasets, reducing the computational resources needed and improving performance on the specific task.

好消息是，许多预训练的LLMs作为开源模型可用，可以作为通用工具来编写、提取和编辑不属于训练数据的文本。此外，LLMs可以在相对较小的数据集上进行特定任务的微调，减少所需的计算资源并提高特定任务的性能。

In this book, we will implement the code for pretraining and use it to pretrain an LLM for educational purposes. All computations will be executable on consumer hardware. After implementing the pretraining code we will learn how to reuse openly available model weights and load them into the architecture we will implement, allowing us to skip the expensive pretraining stage when we finetune LLMs later in this book.

在本书中，我们将实现预训练代码，并使用它为教育目的预训练一个LLM。所有计算都将在消费者硬件上可执行。在实现预训练代码后，我们将学习如何重新使用公开可用的模型权重并将其加载到我们将实现的架构中，这样当我们在本书后续章节中微调LLMs时，可以跳过昂贵的预训练阶段。

1.6 A closer look at the GPT architecture

Previously in this chapter, we mentioned the terms GPT-like models, GPT-3, and ChatGPT. Let’s now take a closer look at the general GPT architecture. First, GPT stands for Generative Pretrained Transformer and was originally introduced in the following paper:

在本章前面，我们提到了类似GPT的模型、GPT-3和ChatGPT。现在让我们仔细看看一般的GPT架构。首先，GPT代表生成预训练Transformer，最初在以下论文中引入：

Improving Language Understanding by Generative Pre-Training (2018) by Radford et al. from OpenAI, http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
由OpenAI的Radford等人撰写的《通过生成预训练改进语言理解》（2018年），http://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

GPT-3 is a scaled-up version of this model that has more parameters and was trained on a larger dataset. And the original model offered in ChatGPT was created by finetuning GPT-3 on a large instruction dataset using a method from OpenAI’s InstructGPT paper, which we will cover in more detail in chapter 7, Finetuning with Human Feedback to Follow Instructions. As we have seen earlier in Figure 1.6, these models are competent text completion models and can carry out other tasks such as spelling correction, classification, or language translation. This is actually very remarkable given that GPT models are pretrained on a relatively simple next-word prediction task, as illustrated in Figure 1.7.

GPT-3是该模型的放大版，具有更多的参数，并在更大的数据集上进行了训练。而ChatGPT中提供的原始模型是通过使用OpenAI的InstructGPT论文中的方法对GPT-3进行微调，在一个大型指令数据集上创建的，我们将在第7章《通过人类反馈微调以遵循指令》中更详细地介绍。如图1.6所示，这些模型是称职的文本完成模型，并且可以执行其他任务，如拼写校正、分类或语言翻译。考虑到GPT模型在相对简单的下一词预测任务上进行预训练，这实际上是非常了不起的，如图1.7所示。

在这里插入图片描述

Figure 1.7 In the next-word pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks.

图1.7 在GPT模型的下一词预训练任务中，系统通过查看之前出现的单词来学习预测句子中的下一个单词。这种方法帮助模型理解单词和短语在语言中通常如何组合在一起，形成可以应用于各种其他任务的基础。

The next-word prediction task is a form of self-supervised learning, which is a form of self-labeling. This means that we don’t need to collect labels for the training data explicitly but can leverage the structure of the data itself: we can use the next word in a sentence or document as the label that the model is supposed to predict. Since this next-word prediction task allows us to create labels “on the fly,” it is possible to leverage massive unlabeled text datasets to train LLMs as previously discussed in section 1.5, Utilizing large datasets.

下一词预测任务是一种自监督学习形式，即一种自标记形式。这意味着我们不需要显式地为训练数据收集标签，而是可以利用数据本身的结构：我们可以使用句子或文档中的下一个单词作为模型应该预测的标签。由于这一下一词预测任务允许我们“即时”创建标签，因此可以利用大量未标记的文本数据集来训练LLMs，如第1.5节《利用大型数据集》中讨论的那样。

Compared to the original transformer architecture we covered in section 1.4, the general GPT architecture is relatively simple. Essentially, it’s just the decoder part without the encoder as illustrated in Figure 1.8. Since decoder-style models like GPT generate text by predicting text one word at a time, they are considered a type of autoregressive model. Autoregressive models incorporate their previous outputs as inputs for future predictions. Consequently, in GPT, each new word is chosen based on the sequence that precedes it, which improves coherence of the resulting text.

与我们在第1.4节中讨论的原始transformer架构相比，GPT的总体架构相对简单。实际上，它只是没有编码器的解码器部分，如图1.8所示。由于类似GPT的解码器样式模型通过一次预测一个单词来生成文本，因此它们被视为一种自回归模型。自回归模型将之前的输出作为未来预测的输入。因此，在GPT中，每个新单词都是根据之前的序列选择的，这提高了生成文本的一致性。

Architectures such as GPT-3 are also significantly larger than the original transformer model. For instance, the original transformer repeated the encoder and decoder blocks six times. GPT-3 has 96 transformer layers and 175 billion parameters in total.

诸如GPT-3之类的架构也显著大于原始transformer模型。例如，原始transformer重复了六次编码器和解码器块。GPT-3总共有96个transformer层和1750亿个参数。

在这里插入图片描述

Figure 1.8 The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time.

图1.8 GPT架构仅使用原始transformer的解码器部分。它设计用于单向的、从左到右的处理，使其非常适合文本生成和下一词预测任务，通过迭代方式一次生成一个单词。

GPT-3 was introduced in 2020, which, by the standards of deep learning and large language model (LLM) development, is considered a long time ago. However, more recent architectures, such as Meta’s LLaMA models, are still based on the same underlying concepts, introducing only minor modifications. Hence, understanding GPT remains as relevant as ever, and this book focuses on implementing the prominent architecture behind GPT while providing pointers to specific tweaks employed by alternative LLMs.

GPT-3在2020年推出，根据深度学习和大型语言模型（LLM）开发的标准，这已经算是很久以前了。然而，较新的架构，如Meta的LLaMA模型，仍然基于相同的基本概念，只进行了轻微的修改。因此，理解GPT依然是非常重要的，本书重点介绍GPT背后的主要架构，同时提供替代LLMs采用的特定调整的指示。

Lastly, it’s interesting to note that although the original transformer model, consisting of encoder and decoder blocks, was explicitly designed for language translation, GPT models—despite their larger yet simpler decoder-only architecture aimed at next-word prediction—are also capable of performing translation tasks. This capability was initially unexpected to researchers, as it emerged from a model primarily trained on a next-word prediction task, which is a task that did not specifically target translation.

最后，有趣的是，尽管原始transformer模型由编码器和解码器块组成，明确设计用于语言翻译，GPT模型——尽管其更大但更简单的仅解码器架构旨在下一词预测——也能够执行翻译任务。这种能力最初是研究人员未曾预料到的，因为它来源于一个主要在下一词预测任务上训练的模型，这一任务并未特别针对翻译。

The ability to perform tasks that the model wasn’t explicitly trained to perform is called an “emergent behavior.” This capability isn’t explicitly taught during training but emerges as a natural consequence of the model’s exposure to vast quantities of multilingual data in diverse contexts. The fact that GPT models can “learn” the translation patterns between languages and perform translation tasks even though they weren’t specifically trained for it demonstrates the benefits and capabilities of these large-scale, generative language models. We can perform diverse tasks without using diverse models for each.

执行模型未明确训练执行的任务的能力被称为“涌现行为”。这种能力不是在训练过程中明确教授的，而是模型在不同上下文中暴露于大量多语言数据的自然结果。GPT模型能够“学习”语言之间的翻译模式并执行翻译任务，尽管它们没有专门为此进行训练，这表明了这些大规模生成语言模型的优势和能力。我们可以执行多样化的任务，而无需为每个任务使用不同的模型。

1.7 Building a large language model

In this chapter, we laid the groundwork for understanding LLMs. In the remainder of this book, we will be coding one from scratch. We will take the fundamental idea behind GPT as a blueprint and tackle this in three stages, as outlined in Figure 1.9.

在本章中，我们为理解LLMs奠定了基础。在本书的其余部分，我们将从头开始编写一个LLM。我们将以GPT背后的基本思想为蓝图，并按图1.9所述分三个阶段进行。

在这里插入图片描述

Figure 1.9 The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier.

图1.9 本书涵盖的构建LLMs的阶段包括实现LLM架构和数据准备过程，预训练一个LLM以创建基础模型，以及微调基础模型成为个人助手或文本分类器。

First, we will learn about the fundamental data preprocessing steps and code the attention mechanism that is at the heart of every LLM.

首先，我们将学习基本的数据预处理步骤并编写每个LLM核心的注意机制代码。

Next, in stage 2, we will learn how to code and pretrain a GPT-like LLM capable of generating new texts. And we will also go over the fundamentals of evaluating LLMs, which is essential for developing capable NLP systems.

接下来，在第二阶段，我们将学习如何编写代码并预训练一个类似GPT的LLM，该模型能够生成新文本。我们还将了解评估LLMs的基础知识，这对于开发能力强大的NLP系统至关重要。

Note that pretraining an LLM from scratch is a significant endeavor, demanding thousands to millions of dollars in computing costs for GPT-like models. Therefore, the focus of stage 2 is on implementing training for educational purposes using a small dataset. In addition, the book will also provide code examples for loading openly available model weights.

请注意，从头开始预训练LLM是一项重大工作，对于类似GPT的模型需要成千上万美元的计算成本。因此，第二阶段的重点是使用小数据集实施教育目的的训练。此外，本书还将提供加载公开可用模型权重的代码示例。

Finally, in stage 3, we will take a pretrained LLM and finetune it to follow instructions such as answering queries or classifying texts – the most common tasks in many real-world applications and research.

最后，在第三阶段，我们将采用一个预训练的LLM并微调其遵循指令，如回答查询或分类文本——这是许多实际应用和研究中最常见的任务。

I hope you are looking forward to embarking on this exciting journey!

希望你期待踏上这段激动人心的旅程！

1.8 Summary

LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learning-driven approaches that led to advancements in understanding, generating, and translating human language.

LLMs已经改变了自然语言处理领域，之前主要依赖于显式基于规则的系统和更简单的统计方法。LLMs的出现引入了新的深度学习驱动方法，导致在理解、生成和翻译人类语言方面取得了进展。

Modern LLMs are trained in two main steps.

First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a “label.”
Then, they are finetuned on a smaller, labeled target dataset to follow instructions or perform classification tasks.

现代LLMs的训练分为两个主要步骤。

首先，它们在大量未标记文本语料库上预训练，使用句子中下一个单词的预测作为“标签”。
然后，它们在较小的标记目标数据集上微调，以遵循指令或执行分类任务。

LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output text.

LLMs基于transformer架构。transformer架构的关键思想是注意机制，这使得LLM在生成输出文本时能够选择性地访问整个输入序列。

The original transformer architecture consists of an encoder for parsing text and a decoder for generating text.

原始transformer架构由一个用于解析文本的编码器和一个用于生成文本的解码器组成。

LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture.

用于生成文本和遵循指令的LLMs，如GPT-3和ChatGPT，只实现解码器模块，简化了架构。

Large datasets consisting of billions of words are essential for pretraining LLMs. In this book, we will implement and train LLMs on small datasets for educational purposes but also see how we can load openly available model weights.

由数十亿个单词组成的大型数据集对于预训练LLMs至关重要。在本书中，我们将实现和训练用于教育目的的小数据集上的LLMs，但也会看到如何加载公开可用的模型权重。

While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit “emergent” properties such as capabilities to classify, translate, or summarize texts.

虽然类似GPT的模型的一般预训练任务是预测句子中的下一个单词，但这些LLMs表现出“涌现”属性，如分类、翻译或总结文本的能力。

Once an LLM is pretrained, the resulting foundation model can be finetuned more efficiently for various downstream tasks.

一旦LLM被预训练，所得的基础模型可以更有效地进行各种下游任务的微调。

LLMs finetuned on custom datasets can outperform general LLMs on specific tasks.

在自定义数据集上微调的LLMs在特定任务上可以优于通用LLMs。

Readers with a background in machine learning may note that labeling information is typically required for traditional machine learning models and deep neural networks trained via the conventional supervised learning paradigm. However, this is not the case for the pretraining stage of LLMs. In this phase, LLMs leverage self-supervised learning, where the model generates its own labels from the input data. This concept is covered later in this chapter.

有机器学习背景的读者可能会注意到，传统机器学习模型和通过常规监督学习范式训练的深度神经网络通常需要标签信息。然而，这并不适用于LLMs的预训练阶段。在这个阶段，LLMs利用自监督学习，模型从输入数据中生成自己的标签。这个概念将在本章后面介绍。

GPT-3, The $4,600,000 Language Model, https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/

GPT-3，价值460万美元的语言模型，https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/