为什么LLM（大模型）使用的是左填充

小鸡不简单

已于 2023-09-26 11:13:25 修改

阅读量880

点赞数

文章标签：人工智能

于 2023-09-25 14:15:51 首次发布

本文链接：https://blog.csdn.net/qq_50097745/article/details/133269245

版权

参考：Why current LLM uses left padding? | Trace Logits

transformer - While fine-tuning a decoder only LLM like LLaMA on chat dataset, what kind of padding should one use? - Artificial Intelligence Stack Exchange

https://github.com/huggingface/transformers/issues/14521

首先先说结论，为什么这样用？

1、因为在生成模型中（causal language model (like GPT series, LLaMA)），大多数都是这样实现的，所以需要对齐。

2、在生成的过程中，生成算法总是用最后一个token的logits去预测下一个token，如果使用右填充，使用的是pad的logits对下一个token采样，这可能会导致生成结果出现错误。

比如：输入，我喜欢苹果[pad] [pad] 。

期望预测和生成的内容：我喜欢吃苹果，因为它很好吃。

右填充模型输出：我喜欢吃苹果[pad] [pad]，因为它很好吃。

左填充模型输出：[pad][pad]我喜欢吃苹果，因为它很好吃。

[pad] 卡在文本中间，这会造成模型生成的结果可能会很差。

也有人生成模型FAIR LLaMA2使用了右填充，但是实现起来相对复杂：

def generate(...)
  ...
	pad_id = self.tokenizer.pad_id
	tokens = torch.full((bsz, total_len), pad_id, dtype=torch.long, device="cuda")
  # right-padding
	for k, t in enumerate(prompt_tokens):
      tokens[k, : len(t)] = torch.tensor(t, dtype=torch.long, device="cuda")
  ...
  # only partial of the prompt are fed
      logits = self.model.forward(tokens[:, prev_pos:cur_pos], prev_pos)
  ...
      next_token = torch.argmax(logits[:, -1], dim=-1)
  ...
      next_token = next_token.reshape(-1)
  # only replace token if prompt has already been generated
      next_token = torch.where(
          input_text_mask[:, cur_pos], tokens[:, cur_pos], next_token
  )
  # so that only padded position are replaced with newly generated tokens
      tokens[:, cur_pos] = next_token

首先，为什么需要填充？

因为神经网络的输入一般是一个矩阵，利用矩阵运算可以大大的加快训练速度，如果句子的长度不一，就会造成输入的一个batch不是一个矩阵，就需要0填充。

如何对一个token进行采样？

确切的答案是：这取决于你使用的生成函数

然而，在大多数实现中，您必须向左填充。

以下是一个GPT2的实现：

def sample_sequence(model, length, start_token=None, batch_size=None, context=None, temperature=1, top_k=0, device='cuda', sample=True, enc=None):
    if start_token is None:
				 # if start_token is None, use context
        assert context is not None, 'Specify exactly one of start_token and context!'
        context = torch.tensor(context, device=device, dtype=torch.long).unsqueeze(
            0).repeat(batch_size, 1)
    else:
				 # if start_token isn't None, use start_token as the beginning of each sentences
        assert context is None, 'Specify exactly one of start_token and context!'
        context = torch.full((batch_size, 1), start_token,
                             device=device, dtype=torch.long)
    prev = context
    output = context
    # past is KV-cache
    past = None
    with torch.no_grad():
        for i in trange(length):  # generate `length` tokens for all sentences
            logits, past = model(prev, past=past)

            # logits.shape=[batch, text, vocab_szie], in Causal model, the logits of the last token in each sentence is used to predict next token, so pick `-1` here
            logits = logits[:, -1, :] / temperature
            logits = top_k_logits(logits, k=top_k)
            log_probs = F.softmax(logits, dim=-1)
            if sample:
                prev = torch.multinomial(log_probs, num_samples=1)
            else:
                _, prev = torch.topk(log_probs, k=1, dim=-1)

            # concatenate the sampled tokens to the original sentences, 
            # e.g. output = [I have] and sampled `an`
            # output = [I have an]
            output = torch.cat((output, prev), dim=1)

    return output

注意 logits = logits[:,-1,:]这一行，如果使用右填充，则会得到，我有一个苹果。[pad],[pad]。。。

然而，算法总是选择最后一个 token 的 logits 来预测下一个 token，如果我们在右侧进行填充，则模型实际上是使用 [pad] 的 logits 来预测下一个 token 。尽管使用注意力掩码，为具有 [pad] 标记的位置分配了注意力分数，但只要算法使用[pad]开始采样，该torch.multinomial函数就会使用错误的 logits，从而导致不正确的预测下一个token。您可以在这里看到类似的问题。

小鸡不简单

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
为什么LLM（大模型）使用的是左填充

然而，算法总是选择最后一个 token 的 logits 来预测下一个 token，如果我们在右侧进行填充，则模型实际上是使用 [pad] 的 logits 来预测下一个 token！2、在生成的过程中，生成算法总是用最后一个token的logits去预测下一个token，如果使用右填充，使用的是pad的logits对下一个token采样，这可能会导致生成结果出现错误。注意 logits = logits[:,-1,:]这一行，如果使用右填充，则会得到，我有一个苹果。
复制链接

扫一扫