【LLM入门】Let‘s reproduce GPT-2 (124M) | 从零复现GPT2（Section 1） | Andrej Karpathy

接深度学习联系丝信

于 2024-08-16 17:50:41 发布

阅读量637

点赞数 10

分类专栏：【LLM入门】文章标签： gpt

本文链接：https://blog.csdn.net/weixin_43154149/article/details/141262200

版权

【LLM入门】专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

from： https://www.youtube.com/watch?v=l8pRSuU81PU&t=983s

视频简介：

465,246次观看 2024年6月10日
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.

Links:

build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nan…
nanoGPT repo: https://github.com/karpathy/nanoGPT
llm.c repo: https://github.com/karpathy/llm.c
my website: https://karpathy.ai
my twitter: / karpathy
our Discord channel: / discord

Supplementary links:

Attention is All You Need paper: https://arxiv.org/abs/1706.03762
OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/b… The GPU I’m training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com

Chapters:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 parameter sharing wte and lm_head
01:13:47 model initialization: std 0.02, residual init
01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms
01:39:38 float16, gradient scalers, bfloat16, 300ms
01:48:15 torch.compile, Python overhead, kernel fusion, 130ms
02:00:18 flash attention, 96ms
02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms
02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run
03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro
03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA
03:59:39 summary, phew, build-nanogpt github repo

Corrections:
I will post all errata and followups to the build-nanogpt GitHub repo (link above)

SuperThanks:
I experimentally enabled them on my channel yesterday. Totally optional and only use if rich. All revenue goes to to supporting my work in AI + Education.

00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint

openai的gpt2 是 tensorflow写的
在这里插入图片描述

使用hugging face 的transformer库，pytorch【学习代码，参考hugging face，非常多模型！】
在这里插入图片描述
位置编码可视化

使用模型

在这里插入图片描述

00:13:47 SECTION 1: implementing the GPT-2 nn.Module

在这里插入图片描述

network

在这里插入图片描述

block
在这里插入图片描述
MLP

在这里插入图片描述

Attention
在这里插入图片描述

GPTConfig
在这里插入图片描述

文章目录

00:28:08 loading the huggingface/GPT-2 parameters

在这里插入图片描述

00:31:00 implementing the forward pass to get logits

在这里插入图片描述

00:33:31 sampling init, prefix tokens, tokenization

在这里插入图片描述

00:37:02 sampling loop

在这里插入图片描述

00:45:50 let’s train: data batches (B,T) → logits (B,T,C)【制作标签，偏移1】

在这里插入图片描述

用jupyter调试，然后再把code搬到py文件中！

在这里插入图片描述

00:52:53 cross entropy loss

forward 顺便返回 loss
在这里插入图片描述

随机初始化的损失，

在这里插入图片描述

00:56:42 optimization loop: overfit a single batch

在这里插入图片描述

01:02:00 data loader lite

在这里插入图片描述

01:06:14 parameter sharing wte and lm_head【权值共享，节省30%的参数！】

在这里插入图片描述

01:13:47 model initialization: std 0.02, residual init【参考gpt2原始代码】

在这里插入图片描述

残差缩放：

在这里插入图片描述

接深度学习联系丝信

关注

10
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
【LLM入门】Let‘s reproduce GPT-2 (124M) | 从零复现GPT2（Section 1） | Andrej Karpathy

使用hugging face 的transformer库，pytorch【学习代码，参考hugging face，非常多模型！用jupyter调试，然后再把code搬到py文件中！openai的gpt2 是 tensorflow写的。465,246次观看 2024年6月10日。forward 顺便返回 loss。随机初始化的损失，
复制链接

扫一扫

专栏目录