llm模型的探讨与初步实践（mingpt源码剖析）-CSDN博客

本文链接：https://blog.csdn.net/jinliuxiacm/article/details/136295692

未来前瞻

兹由目前的大语言模型的广泛使用，不禁猜想这样的发展趋势会最终落实到实践上是什么状态，

可以这样看，llm模型在某种程度上来说更像一部手机，最终的状态应该实现个人的便携化和超大型企业的集成化，在早期通用型人工智能的泛化能力不足，对于下游任务更具有指向性，人工训练和标注的代价巨大，一项专门的领域知识库需要大量领域内人才的集成，人工智能在长期来看一直都处于专家性的人工智能状态，但是如果一个领域可以互通其他领域，在这领域的专家未尝不是多领域内的学者精英，与其说当下的人工智能是通用型人工智能，不如说语言本身才是通用的知识领域。在目前通用型人工系统最贴合实践的可能就是与移动手机结合，智能移动设备会针对大模型本地运行进行改良，大模型的轻量化将会极大程度上提高人机交互的效率和水平。

下面就大语言模型的"hello world"代码进行解读，注：仅本人理解，如果有误还请指出，解读更倾向与个人的学习笔记。

BPE文本处理

bpe是类赫夫曼算法，为解决文本输入提供了平衡方案，如果映射到每个字符，那么颗粒度太细，模型运行时间太长，如果映射到每个单词，那么一些比较罕见的字符会空占运行矩阵，使总体效率降低，而且单词量同时也过大，bpe实现对所有的文本语句都进行划分，而划分的根据是出现的频次，从而使映射矩阵能够被充分利用。

首先就是mingpt的bpe转换器部分

从总体流程上来说，一段文本的输入经过下面几个环节转换为神经网络输入元的数字元组，

1，转换为容易接受的助记字符，比如' '转为为类‘G'

2, 将转换后的文本串记为‘s1',将‘s1'按字符分割，根据bpe树进行合并成子字符串‘s2'

3, 按照已有的bpe映射索引。

def bytes_to_unicode()
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:]
    n = 0
    for b in range(2**8):
        if b not in bs:
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [chr(n) for n in cs]
    d = dict(zip(bs, cs))
    return d
# 将一些表示比较奇怪的字符转换为易记的字符

下面是编码器的主体，正则表达式的使用需要注意划分，同时使用源码时需要注意科学的上网姿势，bpe的文件将在openai官网上获取，在加载到本地后可以离线

class Encoder:

    def __init__(self, encoder, bpe_merges):
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        self.encoder = encoder
        self.decoder = {v:k for k,v in self.encoder.items()}
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
        self.cache = {}

    def bpe(self, token):
        if token in self.cache:
            return self.cache[token]
        word = tuple(token) 
#正则表达式会按字符分割
        pairs = get_pairs(word) 

        if not pairs:
            return token

        while True:
            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))
            # 这个key是对最小值函数进行定义，返回的是一个pair的rank，如果这个pair不在bpe_ranks里面，那么返回无穷大，这里注意bpe_ranks是不会更新的。
            if bigram not in self.bpe_ranks:
                break 
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):

                # find the next occurence of first in the sequence of current words
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break
                if word[i] == first and i < len(word)-1 and word[i+1] == second:
                    new_word.append(first+second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1

            # all occurences of (first, second) have been merged to first_second
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)
        word = ' '.join(word)
        self.cache[token] = word
        return word

    def encode(self, text):
        bpe_idx = []
        # 根据正则表达式，将text分割成一个个的token
        for token in tokens:
            token_bytes = token.encode('utf-8')
            # encode是内置的python函数，将字符串转换成字节
            token_translated = ''.join(self.byte_encoder[b] for b in token_bytes)
            token_merged = self.bpe(token_translated).split(' '）
            token_ix = [self.encoder[bpe_token] for bpe_token in token_merged]
            # extend our running list of all output integers
            bpe_idx.extend(token_ix)
        return bpe_idx

下面拿一个文本进行举例

word="I am llm learner"

进入正则分割

变为

word=["I","am","llm"," learner"]

注意这里我省略掉了空格，原本的元组内的元素会包含空格

接下来进入bty加工

如[" learner"]

会转换为["Glearner"](这里的G会带上一点以区分原来的英文字符）

在此之后进入bpe树划分

如["learner"]按字符划分后为['I','e','a','r','n','e','r']

再合并，可能的结果为["Iearn","er"]

在映射到数字上为[32,3421]

这里是打比方，不代表learner会被转换为[32,3421]

理解操作即可。

模型部分

自注意力机制

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        # output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_dropout = nn.Dropout(config.attn_pdrop)
        self.resid_dropout = nn.Dropout(config.resid_pdrop)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head
        self.n_embd = config.n_embd

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side
        # 调换后的矩阵将不连续
        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y

自注意力机制一开始用于计算机视觉方向，在2020年谷歌团队提出自然语言处理方向后，对bpe解析后的段进行分析

我们可以这样定义，q是对当前位置的查询向量，k(key)是所有字符的注意向量，两者相乘可以得到注意力分数对于每个字词的注意力权重

比如str=[I,am,a,learner]

对于当前位置进行预测的时候，我们为所有字符生成对应的查询向量以及注意向量

为简化问题，我使用数字代表向量

q=[1,2,3,4]

k=[5,6,7,8]

因此我们可以得到第四位的注意力权重=4*[5,6,7,8]=[20,24,28,32]

对于每个字符同时我们假设有它本身的物体向量即本质向量，假设为v(value)

得到了注意力权重之后，我们在与对每个位置的v(value)进行相乘得到最后的对于位置注意力的加权和

v=[9,10,11,12]

score=[180,240,302,384]

在后面层的投影中会对得分进行收缩得到概率矩阵

同时我们可以在计算过程看得到，注意力权重其实没有对位置的关注性，对于这样的模型，输入句子只不过如同散装的三明治，因此我们需要在输入时进行加工，增加对位置的嵌入

另一方面

gpt最开始的设计初衷是进行自回归分析，所以我们应该对在当前字符之后的元素进行覆盖来忽略对未来的信息，还好这样的操作并不难实现，我们可以使用掩码矩阵对上三角进行归零权重

多层感知机架构

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.attn = CasualSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc = nn.Linear(config.n_embd,4*config.n_embd),
            c_proj = nn.Linear(4*config.n_embd,config.n_embd),
            act = NewGELU(),
            dropout = nn.Dropout(config.resid_pdrop)
        ))
        m =self.mlp
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x))))

    def forward(self,x):
        x = x+self.attn(self.ln1(x))
        x = x+self.mlpf(self.ln2(x))
        return x

这里的激活函数采取

$act = 0.5*x*(1.0+tanh(\sqrt{\frac{2}{\pi }}))*(x+0.044715*x^{3}))$

对输入进行层归一化

计算方法是 (x-avg(x))/(std(x))

数据处理的流程可以这么表示

x->lay(x)->att(x)

x->lay(x)->mlpf(x)

多层感知上扩大体量到四倍进行激活，可能是将数据的特征进行放大处理，再收缩为原来的结构

最后残差连接，让模型的预测对前馈层和自注意力机制都处理，可以使较深层的块提前反应，从反向传播的角度上避免梯度爆炸和梯度消失。

迭代器部分

    def generate(self, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.block_size else idx[:, -self.block_size:]
            # forward the model to get the logits for the index in the sequence
            logits, _ = self(idx_cond)
            # pluck the logits at the final step and scale by desired temperature
            logits = logits[:, -1, :] / temperature
            # optionally crop the logits to only the top k options
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float('Inf')
            # apply softmax to convert logits to (normalized) probabilities
            probs = F.softmax(logits, dim=-1)
            # either sample from the distribution or take the most likely element
            if do_sample:
                idx_next = torch.multinomial(probs, num_samples=1)
            else:
                _, idx_next = torch.topk(probs, k=1, dim=-1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

更新到下一个的映射，temperature是改变对应可能的创造力，当temperature越大，此时所有可能单词的概率就越平均，此时多种语句的可能就越多。同时，在之后的强化学习阶段，我们总是对大模型生成的不同语句进行排序，再通过监督模型对其学习，实现强化学习的功能。

从这里我们可以看到总是使用topk选择模型生成的最佳语句数量，从中选出较好的部分进行反馈。

训练器部分

默认设置

  def get_default_config():
        C = CN()
        # device to train on
        C.device = 'auto'
        # dataloder parameters
        C.num_workers = 4
        # optimizer parameters
        C.max_iters = None
        C.batch_size = 64
        C.learning_rate = 3e-4
        C.betas = (0.9, 0.95)
        C.weight_decay = 0.1 # only applied on matmul weights
        C.grad_norm_clip = 1.0
        return C

这方面，mingpt做了大量的类似的默认配置，其接口与原openai的设置接口相近，使得可以调用hugging face网站上的openai模型设置对当前模型设置，这方面的标准设置和代码规范值得学习。

回调函数部分

    def add_callback(self, onevent: str, callback):
        self.callbacks[onevent].append(callback)

    def set_callback(self, onevent: str, callback):
        self.callbacks[onevent] = [callback]

    def trigger_callbacks(self, onevent: str):
        for callback in self.callbacks.get(onevent, []):
            callback(self)

在这里，作者另做了几条回调函数，但是在函数主体，在每次完成一次batch训练后抛出实时训练的数据。

训练流程

从数据集中加载训练数据，在那之后加载数据集迭代器。

 def run(self):
        model, config = self.model, self.config

        # setup the optimizer
        self.optimizer = model.configure_optimizers(config)

        # setup the dataloader
        train_loader = DataLoader(
            self.train_dataset,
            sampler=torch.utils.data.RandomSampler(self.train_dataset, replacement=True, num_samples=int(1e10)),
            shuffle=False,
            pin_memory=True,
            batch_size=config.batch_size,
            num_workers=config.num_workers,
        )

        model.train()
        self.iter_num = 0
        self.iter_time = time.time()
        data_iter = iter(train_loader)
        while True:

            # fetch the next batch (x, y) and re-init iterator if needed
            try:
                batch = next(data_iter)
            except StopIteration:
                data_iter = iter(train_loader)
                batch = next(data_iter)
            batch = [t.to(self.device) for t in batch]
            x, y = batch

            # forward the model
            logits, self.loss = model(x, y)

            # backprop and update the parameters
            model.zero_grad(set_to_none=True)
            self.loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), config.grad_norm_clip)
            self.optimizer.step()

            self.trigger_callbacks('on_batch_end')
            self.iter_num += 1
            tnow = time.time()
            self.iter_dt = tnow - self.iter_time
            self.iter_time = tnow

            # termination conditions
            if config.max_iters is not None and self.iter_num >= config.max_iters:
                break