Llama2是如何理解自然语言的？

最新推荐文章于 2024-09-27 10:58:53 发布

Harden_J_

最新推荐文章于 2024-09-27 10:58:53 发布

阅读量904

点赞数 25

文章标签：人工智能

本文链接：https://blog.csdn.net/Harden_J_/article/details/139952876

版权

自从Transformer架构问世以来，各类大语言模型以及AIGC技术发展迅速，在这里以LLama2 7B模型为例，分享一下学习的心得。

1什么是LLama2？

首先是一个偏官方的解释：LLaMA 2（Large Language Model Meta AI 2）是由Meta（以前的Facebook）开发的第二代大规模语言模型，旨在提供更强的自然语言处理能力和更广泛的应用范围。对于小白来说，最重要的就是要知道LLama2模型是如何理解自然语言并生成回答的。

LLama2模型的任务是在给定前n个单词的基础上预测句子中下一个单词的。这种预测的核心特点是其预测过程依赖于过去和当前的输入信息，而不考虑未来的信息。在每一步预测下一个单词时，只利用当前已生成的单词（即过去和当前的输入），而不使用未来的单词信息。这种机制确保了模型生成的序列是符合自然语言使用习惯的，从左到右逐步生成，类似于人类书写或说话的过程。

2 自然语言的输入处理

2.1 token_id

因为我们的模型只能处理数字信息，所以需要将我们的输入文本转换成数字。LLama2主要是采用基于SentencePiece的分词器类Tokenizer来进行输入文本的处理。LLama2的输入一般如下，是一段文字的输入：

["你好世界"]

文本被切分为单词或字符，形成token序列。token序列进一步被序列化为列表或数组，并通过语料库进行索引化，将每个token映射到一个唯一的整数索引(token_id)，便于模型内部计算。

序列化->[‘BOS’,‘你’,‘好’,‘世’,‘界’,‘EOS’]
假设语料库索引化->[‘BOS’,‘20’,‘00’,‘09’,‘30’,‘EOS’]

在代码中是怎么实现的呢：

首先加载tokenizer模型，然后获取模型的一些重要属性(词汇表大小（n_words），BOS句子开始符，EOS句子结束符）。

    def __init__(self, tokenizer_model=None):
        model_path = tokenizer_model if tokenizer_model else TOKENIZER_MODEL
        assert os.path.isfile(model_path), model_path
        self.sp_model = SentencePieceProcessor(model_file=model_path)
        self.model_path = model_path

        # BOS / EOS token IDs
        self.n_words: int = self.sp_model.vocab_size()
        self.bos_id: int = self.sp_model.bos_id()
        self.eos_id: int = self.sp_model.eos_id()
        self.pad_id: int = self.sp_model.pad_id()
        #print(f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}")
        assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()

目前我们已经把模型定义完成，将文本输入后即可得到token_id:

    def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
        assert type(s) is str
        t = self.sp_model.encode(s)
        if bos:
            t = [self.bos_id] + t
        if eos:
            t = t + [self.eos_id]
        return t

2.2Token embedding

Token Embedding负责将输入的整数序列转换为高维的特征向量表示。

在LLama2中词汇表的大小是32000，嵌入向量的维度定义为4096。输入文本首先被分割成子词或单词，然后在词汇表中查找对应的token ID，从而将输入文本转换为一系列的token ID。根据上述的处理，我们已经得到了一个token序列，接下来需要通过Embedding层将数字token映射为一个实数向量Embeding Vector。其中，每个token对应的向量通常具有固定的维度dim（4096），向量中的每个元素（实数）表示token在特定语义空间中的某个属性或特征。

Embedding Vector可以表示为一个二维数组或矩阵，其形状与token序列长度相同，每个元素是一个固定维度的向量。经过Embedding层后得到的向量表示如下：

'BOS'-> [p_{00},p_{01},p_{02},...,p_{4095}]
'20' -> [p_{00},p_{01},p_{02},...,p_{4095}]
'00'  -> [p_{00},p_{01},p_{02},...,p_{4095}]
'09'-> [p_{00},p_{01},p_{02},...,p_{4095}]
'30' -> [p_{00},p_{01},p_{02},...,p_{4095}]
'EOS'-> [p_{00},p_{01},p_{02},...,p_{4095}]

此时我们获得的嵌入向量矩阵shape为[4, 4096]。

3TransFormer的模型结构

LLama2的主要模型结构如下图所示：

目前主流的LLM模型大多都是基于Transformer构建，LLama2也不例外，它使用了TransFormer模型中的Decoder部分，也就是所谓的Decode-Only架构。同时也在Decoder部分做了如下改进：

LLama2使用 RMSNorm (Root Mean Square Layer Normalization) 进行层归一化。与 LayerNorm 不同，RMSNorm 只依赖于均方根值，这使得归一化过程更稳定，减少了训练过程中的数值不稳定性。
LLama2 使用了相对位置编码（Relative Positional Encoding, RoPE），这比原始 Transformer 中的绝对位置编码更灵活。

3.1 RMS Norm

transformer的第一步是归一化（Normalization），要对输入的Embedding矩阵进行归一化处理主要是因为要让数据落入一个小的特定区间，通常是0-1，这么做可以有助于加快模型的训练速度。避免造成梯度消失或爆炸。

Transformer中的Normalization层一般都是采用LayerNorm来进行归一化，LayerNorm的公式如下：

$y = \frac{x-E(x)}{\sqrt{Var(x)+\epsilon }}*\gamma +\beta$

LLama2使用 RMSNorm (Root Mean Square Layer Normalization) 进行层归一化,只依赖于均方根值，也没有了偏置β。

$y=\frac{x}{\sqrt{Mean(x^{2})+\epsilon }}*\gamma$

代码实现如下：

class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

3.2 Attention

当一个句子被输入到计算机时，程序会将每个词视为一个token，每个token都有一个词嵌入。但是这些词嵌入没有上下文。所以注意力机制的思想是应用某种权重或相似性，让初始词嵌入获得更多上下文信息，从而获得最终带上下文的词嵌入 $Y$ 。

$Y = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})*V$

从公式中可以看出需要用Q矩阵乘以K矩阵的转置除以sqrt(dk)来防止内积过大，经过softmax处理后乘以V矩阵就可以得到attention层的输出。，如下图所示是一个简单的Attention计算过程：

上述图片只实现了基本的注意力机制，而多头注意力机制（Multi-Head Attention, MHA）则是将多个头的查询（Q）、键（K）和值（V）分别进行计算，从而实现多个自注意力机制的并行计算。在多查询注意力机制（Multi-Query Attention, MQA）中，查询（Q）仍然保持多头，但是键（K）和值（V）只有一个，每个头的查询共享相同的键和值。这种方法虽然可以显著减少KV缓存所需的空间，但参数的减少也可能导致精度的下降。

为了在精度和计算效率之间找到平衡，提出了组查询注意力机制（Group Query Attention, GQA）。在GQA中，查询（Q）仍然保持多头，但查询被分组，每组共享一个键（K）和值（V）。这种方法既减少了KV缓存所需的空间，又保留了大部分参数，从而最大程度地减少了精度损失。

下图是一个简单的多头注意力的实现机制

在上图中我们可以看出：采用更多线性层作为键、查询和值。这些线性层并行训练，并且彼此具有独立的权重。每个值、键和查询都为我们提供了 3 个输出，而不是一个输出。这 3 组键和查询给出3种不同的权重。然后将这 3 个权重与 3 个值进行矩阵乘法，得到 3 个输出。将这 3 个注意力连接起来，最终给出一个最终注意力输出。

上面演示中的 3 不是个定值，仅仅是为了演示选择的一个随机数。在实际场景中，这个值可以是任意数量的线性层，每一层被成为一个"头" 。也就是说，可以有任意数量个线性层，提供 $h$ 个注意力输出，然后将它们连接在一起。而这正是多头注意力（multiple heads）名称的由来。以下是它的代码实现：

class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        assert args.n_heads % self.n_kv_heads == 0
        model_parallel_size = 1
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads
        self.wq = nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
        self.attn_dropout = nn.Dropout(args.dropout)
        self.resid_dropout = nn.Dropout(args.dropout)
        self.dropout = args.dropout

        # use flash attention or a manual implementation?
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            mask = torch.full((1, 1, args.max_seq_len, args.max_seq_len), float("-inf"))
            mask = torch.triu(mask, diagonal=1)
            self.register_buffer("mask", mask)

    def forward(
        self,
        x: torch.Tensor,
        freqs_cos: torch.Tensor,
        freqs_sin: torch.Tensor,
    ):
        bsz, seqlen, _ = x.shape

        # QKV
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        # RoPE relative positional embeddings
        xq, xk = apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)

        # grouped multiquery attention: expand out keys and values
        xk = repeat_kv(xk, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
        xv = repeat_kv(xv, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)

        # make heads into a batch dimension
        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        xk = xk.transpose(1, 2)
        xv = xv.transpose(1, 2)

        # flash implementation
        if self.flash:
            output = torch.nn.functional.scaled_dot_product_attention(xq, xk, xv, attn_mask=None, dropout_p=self.dropout if self.training else 0.0, is_causal=True)
        else:
            # manual implementation
            scores = torch.matmul(xq, xk.transpose(2, 3)) / math.sqrt(self.head_dim)
            assert hasattr(self, 'mask')
            scores = scores + self.mask[:, :, :seqlen, :seqlen]   # (bs, n_local_heads, seqlen, cache_len + seqlen)
            scores = F.softmax(scores.float(), dim=-1).type_as(xq)
            scores = self.attn_dropout(scores)
            output = torch.matmul(scores, xv)  # (bs, n_local_heads, seqlen, head_dim)

        # restore time as batch dimension and concat heads
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)

        # final projection into the residual stream
        output = self.wo(output)
        output = self.resid_dropout(output)
        return output

值得注意的是，LLama2模型在每个Attention层中分别对Query（Q）和Key（K）进行旋转位置编码Rotary Positional Embedding, RoPE），即每次计算Attention时，都需要对当前层的Q和K进行位置编码。为什么要进行RoPE呢？传统的绝对位置编码只能捕捉序列中每个位置的绝对位置，而RoPE能够捕捉到相对位置信息，这对于许多自然语言处理任务来说非常重要。例如，在机器翻译中，句子中词语的相对位置关系往往比绝对位置更重要。

3.3 add + rms_norm

这种层结合了残差连接（Residual Connection）和RMS归一化（Root Mean Square Normalization），其主要作用是稳定训练过程并提高模型的性能。残差连接是深度神经网络中的一种技术，最早由ResNet引入。它通过引入快捷连接（skip connections）将输入直接加到输出上，缓解了深层网络中的梯度消失问题。这种连接方式允许梯度在网络中更顺利地传播，使得训练更深层的网络成为可能。

    def forward(self, x, freqs_cos, freqs_sin):
        h = x + self.attention.forward(self.attention_norm(x), freqs_cos, freqs_sin)
        out = h + self.feed_forward.forward(self.ffn_norm(h))
        return out

3.4 FeedForward

FeedForward层通常由两个线性变换和一个非线性激活函数组成。在你的实现中，使用了 F.silu（Sigmoid Linear Unit）作为激活函数。非线性激活函数可以引入非线性，使模型提升表达能力，能够学习和表示更复杂的模式和特征。

$SiLU(x)=x*Sigmoid(x)=\frac{x}{1+e^{-x}}$

class FeedForward(nn.Module):
    def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropout: float):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = 4 * dim
            hidden_dim = int(2 * hidden_dim / 3)
            hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))

4 自回归生成

在LLama2的模型中，采用自回归（AutoGressive）方式通过不断迭代来生成下一个token，代码如下：

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        Also note this is a super inefficient version of sampling with no key/value cache.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.params.max_seq_len else idx[:, -self.params.max_seq_len:]
            # forward the model to get the logits for the index in the sequence
            logits = self(idx_cond)
            logits = logits[:, -1, :] # crop to just the final time step
            if temperature == 0.0:
                # "sample" the single most likely index
                _, idx_next = torch.topk(logits, k=1, dim=-1)
            else:
                # pluck the logits at the final step and scale by desired temperature
                logits = logits / temperature
                # optionally crop the logits to only the top k options
                if top_k is not None:
                    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                    logits[logits < v[:, [-1]]] = -float('Inf')
                # apply softmax to convert logits to (normalized) probabilities
                probs = F.softmax(logits, dim=-1)
                idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

在generate函数中:

idx表示输入的条件序列，形状为（b,t）,其中b是batch size的大小，t是序列长度。
max_new_tokens：要生成的新token的数量。
temperature：控制生成的随机性，默认为0.
top_k：限制采样时只考虑概率最高的k个token。

在读这段代码的时候，我很好奇控制生成的随机性是什么含义，简单来说就是在生成文本的过程中，控制生成的随机性是指调节模型在选择下一个单词时的确定性与多样性之间的平衡。具体来说，这涉及到调节模型生成的输出是否更倾向于选择高概率的单词（确定性）还是探索更多可能的词汇（多样性）。这种控制通常通过两个主要参数来实现：temperature 和 top-k 采样:

Temperature 是一个影响模型生成的随机性的参数。它通过缩放logits（模型输出的未归一化概率）来实现。公式如下：

$logits= \frac{logits}{temperature}$

高 temperature (> 1)：
- 使得logits分布更加平缓。
- 增加低概率单词被选中的机会，从而增加生成文本的多样性。
低 temperature (< 1)：
- 使得logits分布更加尖锐。
- 更倾向于选择高概率单词，增加生成文本的确定性.
temperature = 1：
- 不改变logits分布，保持模型原始的生成概率。
temperature = 0：
- 完全去除随机性，总是选择概率最高的单词（贪婪策略）。

Top-k 采样是另一种控制生成随机性的方法。它限制了模型在每一步生成时只考虑概率最高的k个单词，而忽略其他单词。

Top-k 采样的过程：
- 计算所有单词的logits。
- 选择概率最高的k个单词，将其他单词的logits设置为负无穷（表示不可能选择这些单词）。
- 在k个单词中进行采样。
效果：
- 限制了模型的选择范围，避免生成低概率的单词。
- 通过控制k的大小，可以在确定性和多样性之间找到平衡。
- 较小的k值增加确定性，较大的k值增加多样性。