大语言模型生成式AI学习笔记——1. 2.4LLM预训练和缩放法则——​​​​​​​缩放法则和计算优化模型

研究揭示了模型大小、数据集大小和计算预算之间的关系,表明并非越大越好,而是存在计算最优模型。Chinchilla模型展示了在有限资源下,适中的模型和大数据集可实现良好性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Scaling laws and compute-optimal models(缩放法则和计算优化模型)

In the last video, you explored some of the computational challenges of training large language models. Here you'll learn about research that has explored the relationship between model size, training, configuration and performance in an effort to determine just how big models need to be. Remember, the goal during pre-training is to maximize the model's performance of its learning objective, which is minimizing the loss when predicting tokens. Two options you have to achieve better performance are increasing the size of the dataset you train your model on and increasing the number of parameters in your model. In theory, you could scale either of both of these quantities to improve performance. However, another issue to take into consideration is your compute budget which includes factors like the number of GPUs you have access to and the time you have available for training models.

To help you understand some of the discussion ahead, let's first define a unit of compute that quantifies the required resources. A petaFLOP per second day is a measurement of the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day. Note, one petaFLOP corresponds to one quadrillion floating point operations per second. When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to eight NVIDIA V100 GPUs, operating at full efficiency for one full day. If you have a more powerful processor that can carry out more operations at once, then a petaFLOP per second day requires fewer chips. For example, two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.

To give you an idea off the scale of these compute budgets, this chart shows a comparison off the petaFLOP per second days required to pre-train different variance of Bert and Roberta, which are both encoder only models. T5 and encoder-decoder model and GPT-3, which is a decoder only model. The difference between the models in each family is the number of parameters that were trained, ranging from a few

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值