Scaling laws and compute-optimal models(缩放法则和计算优化模型)
In the last video, you explored some of the computational challenges of training large language models. Here you'll learn about research that has explored the relationship between model size, training, configuration and performance in an effort to determine just how big models need to be. Remember, the goal during pre-training is to maximize the model's performance of its learning objective, which is minimizing the loss when predicting tokens. Two options you have to achieve better performance are increasing the size of the dataset you train your model on and increasing the number of parameters in your model. In theory, you could scale either of both of these quantities to improve performance. However, another issue to take into consideration is your compute budget which includes factors like the number of GPUs you have access to and the time you have available for training models.

To help you understand some of the discussion ahead, let's first define a unit of compute that quantifies the required resources. A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second. When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to eight NVIDIA V100 GPUs, operating at full efficiency for one full day. If you have a more powerful processor that can carry out more operations at once, then a petaFLOP per second day requires fewer chips. For example, two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.


To give you an idea off the scale of these compute budgets, this chart shows a comparison off the petaFLOP per second days required to pre-train different variance of Bert and Roberta, which are both encoder only models. T5 and encoder-decoder model and GPT-3, which is a decoder only model. The difference between the models in each family is the number of parameters that were trained, ranging from a few

研究揭示了模型大小、数据集大小和计算预算之间的关系,表明并非越大越好,而是存在计算最优模型。Chinchilla模型展示了在有限资源下,适中的模型和大数据集可实现良好性能。
最低0.47元/天 解锁文章

被折叠的 条评论
为什么被折叠?



