【LLM入门】Let‘s reproduce GPT-2 (124M) |Section3 训练和评估你的大模型【大模型训练技巧！！超参数，学习率，batchsize，梯度累计，DDP，评估】

最新推荐文章于 2024-08-18 14:11:52 发布

接深度学习联系丝信

最新推荐文章于 2024-08-18 14:11:52 发布

阅读量354

点赞数 4

分类专栏：【LLM入门】文章标签： gpt 学习

本文链接：https://blog.csdn.net/weixin_43154149/article/details/141283280

版权

【LLM入门】专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

from视频：https://www.youtube.com/watch?v=l8pRSuU81PU&t=8098s

简介

468,654次观看 2024年6月10日
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.

02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run

02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping

gradient clipping

在这里插入图片描述

02:21:06 learning rate scheduler: warmup + cosine decay

在这里插入图片描述

手写学习率改变函数
在这里插入图片描述

02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms

在这里插入图片描述

听Andrej Karpathy讲优化过程，有点哲学！
在这里插入图片描述

02:34:09 gradient accumulation【实现更大的batch】

在这里插入图片描述

02:46:52 distributed data parallel (DDP)【多gpu训练，想象8个节点同时运行一段逻辑，而不同的只有ddp_rank, 你必须仔细考虑代码逻辑！8份！】

在这里插入图片描述

初始化：你必须考虑同时8个节点运行这一段逻辑！
在这里插入图片描述

在这里插入图片描述

8个节点拥有的不同，只是ddp_rank
在这里插入图片描述

调试结果

改变 dataloader

在这里插入图片描述

记住：每个进程知道的只有 ddp_rank,ddp_world_size, ddp_rank==0是主进程！
在这里插入图片描述

继续执行代码逻辑！

在这里插入图片描述

文章目录

gpu ddp全用上就是快，1.5 million tokens / sec
在这里插入图片描述

03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)

在这里插入图片描述

分块保存数据

在这里插入图片描述

03:23:10 validation data split, validation loss, sampling revive

在这里插入图片描述

采样，模型输出
在这里插入图片描述
新的评估方式

在这里插入图片描述

接深度学习联系丝信

关注

4
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【LLM入门】Let‘s reproduce GPT-2 (124M) |Section3 训练和评估你的大模型【大模型训练技巧！！超参数，学习率，batchsize，梯度累计，DDP，评估】

from视频：https://www.youtube.com/watch?记住：每个进程知道的只有 ddp_rank,ddp_world_size, ddp_rank==0是主进程！gpu ddp全用上就是快，1.5 million tokens / sec。听Andrej Karpathy讲优化过程，有点哲学！初始化：你必须考虑同时8个节点运行这一段逻辑！468,654次观看 2024年6月10日。8个节点拥有的不同，只是ddp_rank。改变 dataloader。手写学习率改变函数。
复制链接

扫一扫

专栏目录