GPT2代码
1、前言
没有那么多无聊的套话,谨以本文加深我对GPT或者生成式模型的理解,探讨GPT的模型结构和实现细节。
GPT的核心是transformer的decoder部分,通过一连串本文的输入,去预测下一个token。为了使更加通俗易懂了解GPT以及llm的工作原理,决定使用更加轻量级的NanoGPT。
2、开始
2.1 环境
包含pytorch的python环境,要是对安装环境不熟悉,可以使用colab,最后我会贴上在colab可运行的完整代码。
!pip install transformers tiktoken
# 导入的lib
import os
import json
import regex as re
import requests
import numpy as np
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
import pickle
import math
import time
from collections import defaultdict
import tiktoken
2.2 数据下载
本文以tinyshakespeare
数据集为例。利用tiktoken
和BPE
算法进行tokenize。tokenize过程,将文本转化成一系列的数字,这些数字可以作为输入给模型。简单而言就是为模型提供了处理文本数据的基石。
data_dir = os.path.join('data', 'tinyshakespeare')
input_file_path = os.path.join(data_dir, 'input.txt')
if not os.path.exists(input_file_path):
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
os.makedirs(data_dir)
with open(input_file_path, 'w') as f:
f.write(requests.get(data_url).text)
with open(input_file_path, 'r') as f:
data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]
# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")
# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(data_dir, 'train.bin'))
val_ids.tofile(os.path.join(data_dir, 'val.bin'))
2.3 模型参数
class GPTConfig:
def __init__(self, vocab_size, **kwargs):
self.vocab_size = vocab_size
for key, value in kwargs.items():
setattr(self, key, value)
class CustomConfig(GPTConfig):
n_layer = 8
n_head = 8
n_embd = 256
embd_pdrop = 0.1
resid_pdrop = 0.1
attn_pdrop = 0.1
dropout = 0.1
compile = True
device = 'cuda'
num_workers = 0
max_iters = 2e4
batch_size = 4
block_size = 64
learning_rate = 6e-4
betas = (0.9, 0.95)
weight_decay = 1e-1
grad_norm_clip = 1.0
vocab_size = len(train_ids)
config = CustomConfig(vocab_size=vocab_size)
- vocab_size: 词汇表的大小
- n_layer: 模型中的层数
- n_head: 每个层中的注意力头数
- n_embd: 嵌入层的大小
- embd_pdrop: 嵌入层的dropout概率
- resid_pdrop: 残差连接的dropout概率
- attn_pdrop: 注意力权重的dropout概率
- dropout: 全局的dropout概率
- compile: 是否使用torch.compile,这是PyTorch代码加速的最新方法(仅适用于torch版本 > 2.0)
- device: 训练时要使用的设备(‘cpu’或’cuda’)
- num_workers: 用于数据加载的工作线程数
- max_iters: 最大训练迭代次数
- batch_size: 训练过程中使用的批次大小
- block_size: 输入序列的最大长度
- learning_rate: 优化器的学习率
- betas: Adam优化器的beta值元组
- weight_Decay: 优化器的权重衰减
- grad_norm_clip: 训练过程中梯度的最大范数
2.4 Dataloaders定义
为了在PyTorch中使用数据集,我们需要实现两个方法(魔法函数):`__**len__**`,它返回数据集中样本的数量,以及`__getitem__`,它返回数据集中第i个样本。一旦我们创建了数据集,就可以创建一个`DataLoader`对象,该对象接受数据集并以批次加载数据,对其进行l乱序并将其放入GPU进行训练。`DataLoader`是一个迭代器,用于返回训练的数据批次。通过使用PyTorch的`Dataset`和`DataLoader`,我们可以在大型数据集上高效地训练我们的模型,而不必一次性将所有数据加载到内存中。
# 导入先前保存的.bin文件
data_dir = os.path.join('data', 'tinyshakespeare')
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')
class ShakespeareDataset(Dataset):
def __init__(self, split, block_size=128, device_type='cuda'):
assert split in {'train', 'test'}
self.split = split
self.block_size = block_size
self.device_type = device_type
self.data = train_data if split == 'train' else val_data
def __len__(self):
return len(self.data) - self.block_size
def __getitem__(self, idx):
# x y 取相同block size长度的切片,但是,y比x往后多走了一个
x = torch.from_numpy(self.data[idx : idx + self.block_size].astype(np.int64))
y = torch.from_numpy(self.data[idx + 1 : idx + 1 + self.block_size].astype(np.int64))
if self.device_type == 'cuda':
# pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
x, y = x.pin_memory().to('cuda', non_blocking=True), y.pin_memory().to('cuda', non_blocking=True)
else:
x, y = x.to('cpu'), y.to('cpu')
return x, y
# create dataset and dataloader
train_dataset = ShakespeareDataset('train', config.block_size, config.device)
train_loader = DataLoader(train_dataset, batch_size=config.batch_size, num_workers=config.num_workers, drop_last=False)
test_dataset = ShakespeareDataset('test', config.block_size, config.device)
test_loader = DataLoader(test_dataset, batch_size=config.batch_size, num_workers=config.num_workers, drop_last=False)
2.5 GELU Activation Function
GELU(高斯误差线性单元)激活函数是一种非线性激活函数,于2016年由Hendrycks和Gimpel引入。它是ReLU激活函数的平滑近似,并且在某些深度学习模型中表现比ReLU函数更好。GELU函数具有几个理想的特性,例如可微性和范围从-1到正无穷。研究表明,GELU函数可以提高深度学习模型的训练速度和准确性,特别是在自然语言处理任务中。
class NewGELU(nn.Module):
"""
Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
"""
def forward(self, x):
return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
2.6 Causal Self Attention
因果(causal)自注意力是Transformer架构中使用的自注意力机制的一个变种,它是GPT模型的关键组件之一。两者之间的区别在于,因果自注意力将注意力机制限制在仅查看序列中先前的标记,从而适用于生成文本。
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# key, query, value projections for all heads, but in a batch
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
# output projection
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
# regularization
self.attn_dropout = nn.Dropout(config.attn_pdrop)
self.resid_dropout = nn.Dropout(config.resid_pdrop)
self.dropout = config.dropout
self.n_head = config.n_head
self.n_embd = config.n_embd
# flash attention make GPU go brrrrr but support is only in PyTorch >= 2.0
self.flash = hasattr(
torch.nn.functional,
'scaled_dot_product_attention')
if not self.flash:
print(
"WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
# causal mask to ensure that attention is only applied to the left in the input sequence
self.register_buffer(
"mask",
torch.tril(torch.ones(config.block_size, config.block_size)
).view(1, 1, config.block_size, config.block_size))
def forward(self, x):
# batch_size, seq_len, emb_dim
B, T, C = x.size()
# (b, seq_len, emb_dim) --> (b, seq_len, emb_dim * 3) --> (b, seq_len, emb_dim)
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (b, h, seq_len, d_k)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (b, h, seq_len, d_k)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (b, h, seq_len, d_k)
# causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
if self.flash:
# efficient attention using Flash Attention CUDA kernels
y = torch.nn.functional.scaled_dot_product_attention(
q, k, v, attn_mask=None, dropout_p=self.dropout if self.training else 0, is_causal=True
)
else:
# (b, h, seq_len, d_k) matmul (b, h, d_k, seq_len) --> (b, h, seq_len, seq_len)
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
# diagonal mask
# fill 0 mask with super small number so it wont affect the softmax weight
# (batch_size, h, seq_len, seq_len)
att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
# (b, h, seq_len, seq_len) matmul (b, h, seq_len, d_k) --> (b, h, seq_len, d_k)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
# output projection
y = self.resid_dropout(self.c_proj(y))
return y
这个代码需要细细品味嘿嘿嘿。
forward
: 是__call__
方法调用的函数,也就是整个前向反馈流程的核心代码。以(batch_size, seq_len, emb_dim)
形状的x作为输入。- 它将输入x分割成查询(query)、键(key)和值(value)张量以供所有注意力头使用,并相应地重塑它们。然后,它使用flash attention(torch.version ≥ 2.0)或更慢的点积方法(根据pytorch版本)计算注意力分数矩阵。
- dot product attention,注意力是通过
query
和value
张量之间的矩阵乘法计算得到的,然后通过key张量维度的平方根进行缩放。
- 因果掩码的目的是确保模型在生成每个词时只能考虑它之前的词,而不能考虑之后的词。这是自回归语言模型的一个重要特性,允许模型基于已生成的文本序列预测下一个词。使用**
torch.tril(torch.ones(n, n))
来创建这种下三角矩阵。这里,torch.ones(n, n)
创建一个大小为n×n的矩阵,其中所有元素都是1。然后,torch.tril(...)
**函数将该矩阵上三角部分的元素(即对角线以上的元素)置为0,留下一个下三角矩阵。在这个矩阵中,1表示允许注意力流动的方向(即当前词可以注意到的词),而0则阻断了注意力流动(即模型在计算当前词的表示时不会考虑这些位置的词)。
- 得到的矩阵然后使用softmax函数进行归一化,并与
value
张量相乘以获得输出。我们提到的所有这些步骤实际上都是从方程式进行简单的翻译。
- 最后,通过一个残差连接和输出投影,将输出投影到与输入相同的维度上。
2.7 Decoder Block
GPT是没有encoder block的,只有decoder block。这是因为GPT是自回归的,并使用掩码自注意力来预测给定前面标记的序列中的下一个标记。掩码自注意力确保模型不能预先查看序列,只能使用前面的标记进行预测。这也意味着模型不需要学习输入序列的表示,因此encoder 是不必要的。
class Block(nn.Module):
""" GPT decoder block"""
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = nn.ModuleDict(dict(
c_fc = nn.Linear(config.n_embd, 4 * config.n_embd),
act = NewGELU(),
c_proj = nn.Linear(4 * config.n_embd, config.n_embd),
dropout = nn.Dropout(config.resid_pdrop),
))
m = self.mlp
self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x))))
def forward(self, x):
# (batch_size, seq_len, emb_dim)
x = x + self.attn(self.ln_1(x))
x = x + self.mlpf(self.ln_2(x))
return x
总的来说,这个解码器块使得GPT模型能够通过预测给定前面标记的情况下下一个标记的概率分布,自回归地生成新的序列。
2.8 GPT Model
在讨论了GPT模型的各个组成部分之后,我们现在已经到了将所有实现组合起来创建最终GPT模型的时候。通过将多个解码器块堆叠在一起,GPT模型能够生成连贯且具有上下文相关性的文本。
class GPT(nn.Module):
""" GPT Language Model """
def __init__(self, config):
super().__init__()
self.block_size = config.block_size
self.transformer = nn.ModuleDict(dict(
### nn.Embedding 类似于一个查找,通过oneshot去查找对应的token的向量表示
# 和nn.linear不同的是,她俩的权重是转置关系
wte = nn.Embedding(config.vocab_size, config.n_embd),
# 现在呢,也流行用一个可学习参数,来表示位置编码,再通过embedding的包,来查找固定的位置编码
# 如果不懂,可以去看看上面那个medium的博客
wpe = nn.Embedding(config.block_size, config.n_embd),
drop = nn.Dropout(config.embd_pdrop),
h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
ln_f = nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
# init all weights, and apply a special scaled init to the residual projections, per GPT-2 paper
self.apply(self._init_weights)
for pn, p in self.named_parameters():
if pn.endswith('c_proj.weight'):
torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))
# report number of parameters (note we don't count the decoder parameters in lm_head)
n_params = sum(p.numel() for p in self.transformer.parameters())
print("number of parameters: %.2fM" % (n_params/1e6,))
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
elif isinstance(module, nn.LayerNorm):
torch.nn.init.zeros_(module.bias)
torch.nn.init.ones_(module.weight)
def configure_optimizers(self, train_config):
# separate out all parameters to those that will and won't experience regularizing weight decay
decay = set()
no_decay = set()
whitelist_weight_modules = (torch.nn.Linear, )
blacklist_weight_modules = (torch.nn.LayerNorm, torch.nn.Embedding)
for mn, m in self.named_modules():
for pn, p in m.named_parameters():
fpn = '%s.%s' % (mn, pn) if mn else pn # full param name
# random note: because named_modules and named_parameters are recursive
# we will see the same tensors p many many times. but doing it this way
# allows us to know which parent module any tensor p belongs to...
if pn.endswith('bias'):
# all biases will not be decayed
no_decay.add(fpn)
elif pn.endswith('weight') and isinstance(m, whitelist_weight_modules):
# weights of whitelist modules will be weight decayed
decay.add(fpn)
elif pn.endswith('weight') and isinstance(m, blacklist_weight_modules):
# weights of blacklist modules will NOT be weight decayed
no_decay.add(fpn)
# validate that we considered every parameter
param_dict = {pn: p for pn, p in self.named_parameters()}
inter_params = decay & no_decay
union_params = decay | no_decay
# create the pytorch optimizer object
optim_groups = [
{"params": [param_dict[pn] for pn in sorted(list(decay))], "weight_decay": train_config.weight_decay},
{"params": [param_dict[pn] for pn in sorted(list(no_decay))], "weight_decay": 0.0},
]
optimizer = torch.optim.AdamW(optim_groups, lr=train_config.learning_rate, betas=train_config.betas)
return optimizer
def forward(self, idx, targets=None):
# 模型在forward阶段,会根据输入的sequence长度,去计算embedding,因此在输入长度会影响embedding的内存大小
# 然而在self-attention中,k、q、v的计算,也是需要sequence的长度的
# 所以输入的sequence越长,在推理的时候,所需要的GPU显存也就越高
device = idx.device
b, t = idx.size()
assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
# positional token, shape (1, t)
# 生成一个1-t的整数
# tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, t]])
pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)
# forward the GPT model itself
tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
# (b, t, n_embd) -- > # (b, t, vocab_size)
logits = self.lm_head(x)
# if we are given some desired targets also calculate the loss
# -1 at output will be ignored
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
"""
Take a conditioning sequence of indices idx (LongTensor of shape (b, t)) and complete
the sequence max_new_tokens times, feeding the predictions back into the model each time.
Most likely you'll want to make sure to be in model.eval() mode of operation for this.
"""
for _ in range(max_new_tokens):
# if the sequence context is growing too long we must crop it at block_size
idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
# forward the model to get the logits for the index in the sequence
logits, _ = self(idx_cond) # 其实是在__call__中默认调用了forward,就是默认是推理啦
# pluck the logits at the final step and scale by desired temperature
logits = logits[:, -1, :] / temperature
# optionally crop the logits to only the top k options
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float('Inf')
# apply softmax to convert logits to (normalized) probabilities
probs = F.softmax(logits, dim=-1)
# either sample from the distribution or take the most likely element
if do_sample:
# torch.multinomial函数是PyTorch中用于从给定的概率分布中随机抽取样本的函数。
# 这个函数对于实现基于概率的抽样非常有用,
# 特别是在处理诸如文本生成这类需要根据预测概率随机选择下一个输出的任务中。
idx_next = torch.multinomial(probs, num_samples=1)
else:
# torch.topk Returns the k largest elements of the given input tensor along a given dimension.
_, idx_next = torch.topk(probs, k=1, dim=-1)
# append sampled index to the running sequence and continue
idx = torch.cat((idx, idx_next), dim=1)
return idx
- forward method计算了GPT模型的前向传播过程。它接受单词索引的张量(idx)和目标索引的张量(targets)作为输入。该方法首先将一个嵌入层应用于单词索引,并将一个位置编码层应用于位置索引,输入deocder层中。
- 然后使用langugage model head来得到下一个token的概率分布情况
- 利用cross-entropy loss来计算预测分布和目标分布之间的损失。
2.9 训练代码
可以通过以下模型来训练。
class Trainer:
def __init__(self, config, model, train_dataset):
self.config = config
self.model = model
self.optimizer = None
self.train_dataset = train_dataset
self.callbacks = defaultdict(list)
self.device = config.device
self.model = self.model.to(self.device)
# variables that will be assigned to trainer class later for logging and etc
self.iter_num = 0
self.iter_time = 0.0
self.iter_dt = 0.0
def add_callback(self, onevent: str, callback):
self.callbacks[onevent].append(callback)
def set_callback(self, onevent: str, callback):
self.callbacks[onevent] = [callback]
def trigger_callbacks(self, onevent: str):
for callback in self.callbacks.get(onevent, []):
callback(self)
def run(self):
model, config = self.model, self.config
# setup the optimizer
self.optimizer = model.configure_optimizers(config)
# setup the dataloader
train_loader = DataLoader(
self.train_dataset,
sampler=torch.utils.data.RandomSampler(self.train_dataset, replacement=True, num_samples=int(1e10)),
shuffle=False,
# pin_memory=True,
batch_size=config.batch_size,
num_workers=config.num_workers,
)
model.train()
self.iter_num = 0
self.iter_time = time.time()
data_iter = iter(train_loader)
while True:
# fetch the next batch (x, y) and re-init iterator if needed
try:
batch = next(data_iter)
except StopIteration:
data_iter = iter(train_loader)
batch = next(data_iter)
batch = [t.to(self.device) for t in batch]
x, y = batch
# forward the model
logits, self.loss = model(x, y)
# backprop and update the parameters
model.zero_grad(set_to_none=True)
self.loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
self.optimizer.step()
self.trigger_callbacks('on_batch_end')
self.iter_num += 1
tnow = time.time()
self.iter_dt = tnow - self.iter_time
self.iter_time = tnow
# termination conditions
if config.max_iters is not None and self.iter_num >= config.max_iters:
break
model = GPT(config).to(config.device)
if config.compile:
model = torch.compile(model)
trainer = Trainer(config, model, train_dataset)
def batch_end_callback(trainer):
if trainer.iter_num % 500 == 0:
print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")
trainer.set_callback('on_batch_end', batch_end_callback)
trainer.run()
2.10 生成本文
text = 'Lord:\nRise! My people, conquer the north!'
sample_ids = torch.Tensor(enc.encode_ordinary(text)).long()
sample_ids = torch.unsqueeze(sample_ids, 0).to(config.device)
result = model.generate(sample_ids, max_new_tokens=50, temperature=1, do_sample=False, top_k=None)
print(enc.decode(result.detach().cpu().tolist()[0]))
3、总体流程白话文版本
GPT是一种自回归语言模型,它接受一个系列文本作为条件,然后一次生成一个令牌(token)来产生新文本。模型根据序列中先前的令牌生成每个令牌。
generate
函数是GPT类中的一个方法,它基于给定的输入序列生成新文本。它接受形状为(批量大小,序列长度)的索引条件序列idx
。然后,该函数最多完成max_new_tokens
次序列补全,每次都将预测结果反馈到模型中。- 它对模型进行前向传递,以获得序列中索引的对数几率。对数几率代表了可能令牌的词汇表上未归一化的概率分布。
- 接下来,该函数提取最后一步的对数几率,并按所需temperature进行缩放。temperature用于控制生成输出的随机性。较高的temperature会导致更多样化和随机的输出,而较低的temperature会导致更保守和可预测的输出。(后续会细讲这个影响)
- 然后,它应用softmax函数将对数几率转换为归一化的概率。这些概率代表了词汇表中每个令牌作为生成序列中下一个令牌的可能性。
- 最后,该函数要么使用
torch.multinomial()
从概率分布中进行抽样。然后,它将抽样的索引附加到运行中的序列,并继续循环,直到达到max_new_tokens
。
4、colab完整代码
Reference:
https://ai.plainenglish.io/creating-and-exploring-gpt-from-scratch-ffe84ac415a9