【LLM入门】Let‘s reproduce GPT-2 (124M) | 从零复现GPT2(Section 1) | Andrej Karpathy


from: https://www.youtube.com/watch?v=l8pRSuU81PU&t=983s

视频简介:

465,246次观看 2024年6月10日
We reproduce the GPT-2 (124M) from scratch. This video covers the whole process: First we build the GPT-2 network, then we optimize its training to be really fast, then we set up the training run following the GPT-2 and GPT-3 paper and their hyperparameters, then we hit run, and come back the next morning to see our results, and enjoy some amusing model generations. Keep in mind that in some places this video builds on the knowledge from earlier videos in the Zero to Hero Playlist (see my channel). You could also see this video as building my nanoGPT repo, which by the end is about 90% similar.

Links:

  • build-nanogpt GitHub repo, with all the changes in this video as individual commits: https://github.com/karpathy/build-nan…
  • nanoGPT repo: https://github.com/karpathy/nanoGPT
  • llm.c repo: https://github.com/karpathy/llm.c
  • my website: https://karpathy.ai
  • my twitter: / karpathy
  • our Discord channel: / discord

Supplementary links:

  • Attention is All You Need paper: https://arxiv.org/abs/1706.03762
  • OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 - OpenAI GPT-2 paper: https://d4mucfpksywv.cloudfront.net/b… The GPU I’m training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com

Chapters:
00:00:00 intro: Let’s reproduce GPT-2 (124M)
00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint
00:13:47 SECTION 1: implementing the GPT-2 nn.Module
00:28:08 loading the huggingface/GPT-2 parameters
00:31:00 implementing the forward pass to get logits
00:33:31 sampling init, prefix tokens, tokenization
00:37:02 sampling loop
00:41:47 sample, auto-detect the device
00:45:50 let’s train: data batches (B,T) → logits (B,T,C)
00:52:53 cross entropy loss
00:56:42 optimization loop: overfit a single batch
01:02:00 data loader lite
01:06:14 parameter sharing wte and lm_head
01:13:47 model initialization: std 0.02, residual init
01:22:18 SECTION 2: Let’s make it fast. GPUs, mixed precision, 1000ms
01:28:14 Tensor Cores, timing the code, TF32 precision, 333ms
01:39:38 float16, gradient scalers, bfloat16, 300ms
01:48:15 torch.compile, Python overhead, kernel fusion, 130ms
02:00:18 flash attention, 96ms
02:06:54 nice/ugly numbers. vocab size 50257 → 50304, 93ms
02:14:55 SECTION 3: hyperpamaters, AdamW, gradient clipping
02:21:06 learning rate scheduler: warmup + cosine decay
02:26:21 batch size schedule, weight decay, FusedAdamW, 90ms
02:34:09 gradient accumulation
02:46:52 distributed data parallel (DDP)
03:10:21 datasets used in GPT-2, GPT-3, FineWeb (EDU)
03:23:10 validation data split, validation loss, sampling revive
03:28:23 evaluation: HellaSwag, starting the run
03:43:05 SECTION 4: results in the morning! GPT-2, GPT-3 repro
03:56:21 shoutout to llm.c, equivalent but faster code in raw C/CUDA
03:59:39 summary, phew, build-nanogpt github repo

Corrections:
I will post all errata and followups to the build-nanogpt GitHub repo (link above)

SuperThanks:
I experimentally enabled them on my channel yesterday. Totally optional and only use if rich. All revenue goes to to supporting my work in AI + Education.


00:03:39 exploring the GPT-2 (124M) OpenAI checkpoint

openai的gpt2 是 tensorflow写的
在这里插入图片描述

使用hugging face 的transformer库,pytorch【学习代码,参考hugging face,非常多模型!】
在这里插入图片描述
位置编码可视化
在这里插入图片描述
使用模型

在这里插入图片描述

00:13:47 SECTION 1: implementing the GPT-2 nn.Module

在这里插入图片描述

network

在这里插入图片描述

block
在这里插入图片描述
MLP

在这里插入图片描述

Attention
在这里插入图片描述

GPTConfig
在这里插入图片描述

00:28:08 loading the huggingface/GPT-2 parameters

在这里插入图片描述

00:31:00 implementing the forward pass to get logits

在这里插入图片描述

00:33:31 sampling init, prefix tokens, tokenization

在这里插入图片描述

00:37:02 sampling loop

在这里插入图片描述

00:45:50 let’s train: data batches (B,T) → logits (B,T,C)【制作标签,偏移1】

在这里插入图片描述

用jupyter调试,然后再把code搬到py文件中!

在这里插入图片描述

00:52:53 cross entropy loss

forward 顺便返回 loss
在这里插入图片描述

随机初始化的 损失,

在这里插入图片描述

00:56:42 optimization loop: overfit a single batch

在这里插入图片描述

01:02:00 data loader lite

在这里插入图片描述
在这里插入图片描述

01:06:14 parameter sharing wte and lm_head【权值共享,节省30%的参数!】

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述

01:13:47 model initialization: std 0.02, residual init【参考gpt2原始代码】

在这里插入图片描述

在这里插入图片描述

残差缩放:

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

  • 10
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值