Scaling laws and compute-optimal models(缩放法则和计算优化模型)
In the last video, you explored some of the computational challenges of training large language models. Here you'll learn about research that has explored the relationship between model size, training, configuration and performance in an effort to determine just how big models need to be. Remember, the goal during pre-training is to maximize the model's performance of its learning objective, which is minimizing the loss when predicting tokens. Two options you have to achieve better performance are increasing the size of the dataset you train your model on and increasing the number of parameters in your model. In theory, you could scale either of both of these quantities to improve performance. However, another issue to take into consideration is your compute budget which includes factors like the number of GPUs you have access to and the time you have available for training models.
To help you understand some of the discussion ahead, let's first define a unit of compute that quantifies the required resources. A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second. When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to eight NVIDIA V100 GPUs, operating at full efficiency for one full day. If you have a more powerful processor that can carry out more operations at once, then a petaFLOP per second day requires fewer chips. For example, two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.
To give you an idea off the scale of these compute budgets, this chart shows a comparison off the petaFLOP per second days required to pre-train different variance of Bert and Roberta, which are both encoder only models. T5 and encoder-decoder model and GPT-3, which is a decoder only model. The difference between the models in each family is the number of parameters that were trained, ranging from a few hundred million for Bert base to 175 billion for the largest GPT-3 variant. Note that the y-axis is logarithmic. Each increment vertically is a power of 10. Here we see that T5 XL with three billion parameters required close to 100 petaFLOP per second days. While the larger GPT-3 175 billion parameter model required approximately 3,700 petaFLOP per second days. This chart makes it clear that a huge amount of computers required to train the largest models. You can see that bigger models take more compute resources to train and generally also require more data to achieve good performance. It turns out that they are actually well-defined relationships between these three scaling choices.
Researchers have explored the trade-offs between training dataset size, model size and compute budget. Here's a figure from a paper by researchers at OpenAI that explores the impact of compute budget on model performance. The y-axis is the test loss, which you can consider as a proxy for model performance where smaller values are better. The x-axis is the compute budget in units of petaFLOP per second days. As you just saw, larger numbers can be achieved by either using more compute power or training for longer or both. Each thin blue line here shows the model loss over a single training run. Looking at where the loss starts to decline more slowly for each run, reveals a clear relationship between the compute budget and the model's performance. This can be approximated by a power-law relationship, shown by this pink line. A power law is a mathematical relationship between two variables, where one is proportional to the other raised to some power. When plotted on a graph where both axes are logarithmic, power-law relationships appear as straight lines. The relationship here holds as long as model size and training dataset size don't inhibit the training process. Taken at face value, this would suggest that you can just increase your compute budget to achieve better model performance. In practice however, the compute resources you have available for training will generally be a hard constraint set by factors such as the hardware you have access to, the time available for training and the financial budget of the project. If you hold your compute budget fixed, the two levers you have to improve your model's performance are the size of the training dataset and the number of parameters in your model.
The OpenAI researchers found that these two quantities also show a power-law relationship with a test loss in the case where the other two variables are held fixed. This is another figure from the paper exploring the impact of training dataset size on model performance. Here, the compute budget and model size are held fixed and the size of the training dataset is vary. The graph shows that as the volume of training data increases, the performance of the model continues to improve. In the second graph, the compute budget and training dataset size are held constant. Models of varying numbers of parameters are trained. As the model increases in size, the test loss decreases indicating better performance.
At this point you might be asking, what's the ideal balance between these three quantities? Well, it turns out a lot of people are interested in this question. Both research and industry communities have published a lot of empirical data for pre-training compute optimal models. In a paper published in 2022, a group of researchers led by Jordan Hoffmann, Sebastian Borgeaud and Arthur Mensch carried out a detailed study of the performance of language models of various sizes and quantities of training data. The goal was to find the optimal number of parameters and volume of training data for a given compute budget. The author's name, the resulting compute optimal model, Chinchilla. This paper is often referred to as the Chinchilla paper.
Let's take a look at some of their findings. The Chinchilla paper hints that many of the 100 billion parameter large language models like GPT-3 may actually be over parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under trained so that they would benefit from seeing more training data. The authors hypothesized that smaller models may be able to achieve the same performance as much larger ones if they are trained on larger datasets. In this table, you can see a selection of models along with their size and information about the dataset they were trained on.
One important takeaway from the Chinchilla paper is that the optimal training dataset size for a given model is about 20 times larger than the number of parameters in the model. Chinchilla was determined to be compute optimal. For a 70 billion parameter model, the ideal training dataset contains 1.4 trillion tokens or 20 times the number of parameters. The last three models in the table were trained on datasets that are smaller than the Chinchilla optimal size. These models may actually be under trained. In contrast, LLaMA was trained on a dataset size of 1.4 trillion tokens, which is close to the Chinchilla recommended number. Another important result from the paper is that the compute optimal Chinchilla model outperforms non compute optimal models such as GPT-3 on a large range of downstream evaluation tasks. With the results of the Chinchilla paper in hand teams have recently started to develop smaller models that achieved similar, if not better results than larger models that were trained in a non-optimal way. Moving forward, you can probably expect to see a deviation from the bigger is always better trends of the last few years as more teams or developers like you start to optimize their model design.
The last model shown on this slide, Bloomberg GPT, is a really interesting model. It was trained in a compute optimal way following the Chinchilla loss and so achieves good performance with the size of 50 billion parameters. It's also an interesting example of a situation where pre-training a model from scratch was necessary to achieve good task performance. Let's move on to the last video of this week to discuss why.
在上一个视频中,你探讨了训练大型语言模型的一些计算挑战。在这里,你将了解到一些研究,这些研究探讨了模型大小、训练、配置和性能之间的关系,以确定模型需要多大。记住,预训练的目标是最大化模型的学习目标的性能,即最小化预测tokens时的损失。要实现更好的性能,你有两个选择:增加训练模型的数据集大小和增加模型中的参数数量。理论上,你可以扩大这两个量中的任何一个或两者都来提高性能。然而,另一个需要考虑的问题是你的计算预算,包括你能使用的GPU数量以及你用于训练模型的时间等因素。
为了帮助你理解接下来的讨论,让我们首先定义一个计算单位来衡量所需的资源。每秒petaFLOP天是一个衡量以每秒一petaFLOP的速度运行一整天所需资源的数量。注意,1 petaFLOP对应每秒一千万亿次浮点运算。当具体考虑训练transformers时,每秒钟一天大约相当于8块NVIDIA V100 GPU,全效运行一整天。如果你有更强大的处理器可以一次执行更多的操作,那么每秒一天所需的芯片就会更少。例如,2块NVIDIA A100 GPU就能提供与8块V100芯片相当的计算能力。
为了让你对这种计算预算的规模有所了解,这张图表比较了预训练不同变体的Bert和Roberta(这两种都是仅编码器模型)所需的每秒petaFLOP天数。T5是编码器-解码器模型,而GPT-3则是仅解码器模型。每个家族中模型之间的区别在于被训练的参数数量,从Bert基础版的几亿到最大的GPT-3变体的1750亿。注意y轴是对数刻度,每个垂直增量是10的幂。这里我们看到,具有30亿参数的T5 XL接近需要100个每秒petaFLOP天。而更大的GPT-3 1750亿参数模型大约需要3700个每秒petaFLOP天。这张图表清楚地表明,训练最大的模型需要大量的计算机。你可以看到,更大的模型需要更多的计算资源来训练,并且通常也需要更多的数据才能获得良好的性能。事实证明,这三个缩放选项之间实际上有明确的关系。
研究人员已经探索了训练数据集大小、模型大小和计算预算之间的权衡。这是OpenAI的研究人员在一篇论文中探索计算预算对模型性能影响的图表。y轴是测试损失,你可以将其视为模型性能的代表,其中较小的值更好。x轴是以每秒petaFLOP天为单位的计算预算。正如你刚刚看到的,较大的数字可以通过使用更多的计算能力或训练更长时间或两者都来实现。这里的每条细蓝线显示了单次训练运行期间的模型损失。观察每次运行损失开始缓慢下降的地方,揭示了计算预算和模型性能之间的明确关系。这可以通过粉红线所示的幂律关系来近似。幂律是两个变量之间的数学关系,其中一个变量与另一个变量的某个幂成比例。当在一个双对数坐标图上绘制时,幂律关系表现为直线。只要模型大小和训练数据集大小不阻碍训练过程,这种关系就成立。表面上看,这将意味着你只需增加你的计算预算就能获得更好的模型性能。然而,实际上,你可用于训练的计算资源通常会受到诸如你能访问的硬件、训练可用的时间以及项目的财务预算等因素的限制。如果你固定你的计算预算,那么你用来提高模型性能的两个杠杆就是你的训练数据集的大小和模型中的参数数量。
OpenAI的研究人员发现,在其他两个变量固定的情况下,这两个量与测试损失也呈现出幂律关系。这是论文中的另一张图表,探讨了训练数据集大小对模型性能的影响。在这里,计算预算和模型大小是固定的,而训练数据集的大小是变化的。图表显示,随着训练数据量的增加,模型的性能持续提高。在第二张图中,计算预算和训练数据集大小保持不变。训练了不同参数数量的模型。随着模型大小的增加,测试损失减少,表明性能更好。此时,你可能会问,这三个量之间的理想平衡是什么?事实证明,很多人都对这个问题感兴趣。研究和工业界都发表了很多关于预训练计算最优模型的实证数据。在2022年发表的一篇论文中,由Jordan Hoffmann、Sebastian Borgeaud和Arthur Mensch领导的一组研究人员对各种大小和训练数据量的语言模型的性能进行了详细研究。目标是为给定的计算预算找到最优的参数数量和训练数据量。作者的名字,得到的计算最优模型,Chinchilla。这篇论文通常被称为Chinchilla论文。
让我们来看一些他们的发现。Chinchilla论文暗示,许多像GPT-3这样的1000亿参数大型语言模型实际上可能是过度参数化的,意味着它们的参数超过了很好地理解语言所需的数量,也不是因为训练不足(undertrained)而需要更多训练数据。作者假设,如果训练在更大的数据集上,较小的模型可能能够达到与更大的模型相同的性能。在这个表格中,你可以看到一系列模型及其大小和训练数据集的信息。
Chinchilla论文的一个重要结论是,对于给定的模型,最优的训练数据集大小大约是模型参数数量的20倍。Chinchilla被确定为计算最优。对于一个700亿参数的模型,理想的训练数据集包含1.4万亿个tokens或参数数量的20倍。表中的最后三个模型是在比Chinchilla最优大小小的数据集上训练的。这些模型实际上可能训练不足。相比之下,LLaMA是在1.4万亿个tokens的数据集大小上训练的,这接近Chinchilla推荐的数字。论文的另一个重要结果是,计算最优的Chinchilla模型在广泛的下游评估任务上优于非计算最优的模型,如GPT-3。有了Chinchilla论文的结果,团队最近开始开发较小的模型,这些模型取得了与以非优化方式训练的较大模型相似甚至更好的结果。展望未来,随着更多的团队或像你这样的开发者开始优化他们的模型设计,你可能会看到过去几年“越大越好”趋势的偏离。
这张幻灯片上显示的最后一个模型,Bloomberg GPT,是一个非常有趣的模型。它是以一种计算最优的方式遵循Chinchilla损失进行训练的,因此在500亿参数的大小下取得了良好的性能。这也是一个有趣的例子,说明了从头开始预训练模型以实现良好的任务性能是必要的。让我们继续本周的最后一个视频,讨论为什么。