LLM大模型学习：大模型生成回复，解决“复读机”的原理

最新推荐文章于 2025-03-02 14:28:26 发布

大模型入门学习

最新推荐文章于 2025-03-02 14:28:26 发布

阅读量1.3k

点赞数 16

文章标签：学习深度学习 python 产品经理人工智能 AI大模型大模型教程

本文链接：https://blog.csdn.net/2401_84494441/article/details/141993983

版权

前言

在阅读本文前，需要熟悉torch的常用算子，现列举如下：

torch.where(condition, x, y) 根据条件选择张量
- condition 条件
- x，当condition=True时，选择 x 的值
- y，当condition=False时，选择 y 的值
Tensor.scatter_(dim, index, src) 类似于torch.scatter
torch.scatter(output, dim, index, src) 将张量 src 中的值，按照元素索引写入output中。
- output 输出
- dim要填充的维度
- index 要填充的维度
- src 输入
torch.gather(input, dim, index, ***, sparse_grad=False, out=None) torch.scatter的反向操作
- input 输入
- dim维度
- index 元素的索引
- out 输出
torch.sort(input, dim=- 1, descending=False, stable=False, out=None) 排序
- dim 维度
- descending TRUE表示降序排列,False表示升序排列
torch.softmax(input, dim=None)
cumsum(input, dim, dtype=None) 累加
- 输入
- dim 维度
Tensor.masked_fill(mask, value) 填充值
- mask bool类型，或者值是0,1的int类型
- value
- 将mask中为True的元素对应位置的设为value。
torch.topk(input, k, dim=None, largest=True, sorted=True, ***, out=None) 返回k个最大的值
torch.multinomial(input, num_samples, replacement=False, ***, generator=None, out=None) 抽取样本
- num_samples 抽取的样本数，注意这个时候不一定抽取到最大的值，增加了随机性
torch.div(input, other, ***, rounding_mode=None, out=None) 除法
- input 输入
- other 一个数字或者和input维度相同的向量

模型输出示例程序如下：

response, history = self.model.chat(
        self.tokenizer,
        prompt,
        history=[],
        max_length=8192,
        top_p=1, do_sample=False,
        temperature=0.001)

实际上每个大模型都继承了PreTrainedModel，在预测时，会调用GenerationMixin的generate方法，类之间的继承关系如下

模型生成回答，主要有以下几种搜索方法：

Contrastive Search

略

Multinomial sampling

与总是选择概率最高的标记作为下一个标记的贪婪搜索相反，Multinomial sampling根据模型给出的整个词汇表的概率分布随机选择下一个标记。

只需要将do_sample设为True即可

outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)

源码解读

while True:
    # forward pass to get next token
    outputs = self(
        **model_inputs,
        return_dict=True,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
    )

    next_token_logits = outputs.logits[:, -1, :]

    next_token_scores = logits_processor(input_ids, next_token_logits)
    next_token_scores = logits_warper(input_ids, next_token_scores)

    probs = nn.functional.softmax(next_token_scores, dim=-1)
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

Beam Search的实现

beam search是对贪心策略一个改进。思路也很简单，就是稍微放宽一些考察的范围。在每一个时间步，不再只保留当前分数最高的1个输出，而是保留num_beams个。当num_beams=1时集束搜索就退化成了贪心搜索。

下图是一个实际的例子，每个时间步有ABCDE共5种可能的输出，即，图中的num_beams=2，也就是说每个时间步都会保留到当前步为止条件概率最优的2个序列。

要实现beam search，只需要以下代码：

outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)

源码解读

代码很优雅

input_ids = input_ids.repeat_interleave(expand_size, dim=0)

// 每次输出num_beams个结果
while True:
    outputs = self(
        **model_inputs,
        return_dict=True,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
    )

    next_token_logits = outputs.logits[:, -1, :]
    next_token_scores = nn.functional.log_softmax(
        next_token_logits, dim=-1
    )  # (batch_size * num_beams, vocab_size)

    # reshape for beam search
    vocab_size = next_token_scores.shape[-1]
    next_token_scores = next_token_scores.view(batch_size, num_beams * vocab_size)

    # Sample 1 + len(eos_token_id) next tokens for each beam so we have at least 1 non eos token per beam.
    n_eos_tokens = len(eos_token_id) if eos_token_id else 0
    next_token_scores, next_tokens = torch.topk(
        next_token_scores, max(2, 1 + n_eos_tokens) * num_beams, dim=1, largest=True, sorted=True
    )

    next_indices = torch.div(next_tokens, vocab_size, rounding_mode="floor")
    next_tokens = next_tokens % vocab_size

    # stateless
    beam_outputs = beam_scorer.process(
        input_ids,
        next_token_scores,
        next_tokens,
        next_indices,
        pad_token_id=pad_token_id,
        eos_token_id=eos_token_id,
        beam_indices=beam_indices,
        decoder_prompt_len=decoder_prompt_len,
    )

    beam_scores = beam_outputs["next_beam_scores"]
    beam_next_tokens = beam_outputs["next_beam_tokens"]
    beam_idx = beam_outputs["next_beam_indices"]

    input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)

// 句子预测完，最后处理多个beams的结果
sequence_outputs = beam_scorer.finalize(
            input_ids,
            beam_scores,
            next_tokens,
            next_indices,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            max_length=stopping_criteria.max_length,
            beam_indices=beam_indices,
            decoder_prompt_len=decoder_prompt_len,
        )

流程：

将input 扩张成num_beams 个

input_ids = input_ids.repeat_interleave(expand_size, dim=0)

模型预测

outputs = self(
        **model_inputs,
        return_dict=True,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
    )

获取最好的num_beams个结果

next_token_logits = outputs.logits[:, -1, :]
......
next_token_scores, next_tokens = torch.topk(
        next_token_scores, max(2, 1 + n_eos_tokens) * num_beams, dim=1, largest=True, sorted=True
)

将结果保存

beam_outputs = beam_scorer.process(
    input_ids,
    next_token_scores,
    next_tokens,
    next_indices,
    pad_token_id=pad_token_id,
    eos_token_id=eos_token_id,
    beam_indices=beam_indices,
    decoder_prompt_len=decoder_prompt_len,
)

将最好的k个添加到input_ids

input_ids = torch.cat([input_ids[beam_idx, :], beam_next_tokens.unsqueeze(-1)], dim=-1)

当预测完毕

sequence_outputs = beam_scorer.finalize(
            input_ids,
            beam_scores,
            next_tokens,
            next_indices,
            pad_token_id=pad_token_id,
            eos_token_id=eos_token_id,
            max_length=stopping_criteria.max_length,
            beam_indices=beam_indices,
            decoder_prompt_len=decoder_prompt_len,
        )

因为考虑到多种情况，可能有多个batch_size，所以下面的代码将已完成（eos_token_id）的添加到_beam_hyps中

class BeamSearchScorer(BeamScorer):
   
    def __init__(self):
        self._beam_hyps = [
            BeamHypotheses(
                num_beams=self.group_size,
                length_penalty=self.length_penalty,
                early_stopping=self.do_early_stopping,
                max_length=max_length,
            )
            for _ in range(batch_size * self.num_beam_groups)
        ]

    def process(self) -> Dict[str, torch.Tensor]:
      
        batch_size = len(self._beam_hyps) // self.num_beam_groups

        device = input_ids.device
        next_beam_scores = torch.zeros((batch_size, self.group_size), dtype=next_scores.dtype, device=device)
        next_beam_tokens = torch.zeros((batch_size, self.group_size), dtype=next_tokens.dtype, device=device)
        next_beam_indices = torch.zeros((batch_size, self.group_size), dtype=next_indices.dtype, device=device)

        if isinstance(eos_token_id, int):
            eos_token_id = [eos_token_id]

        for batch_idx in range(batch_size):
            beam_idx = 0
            for beam_token_rank, (next_token, next_score, next_index) in enumerate(
                zip(next_tokens[batch_idx], next_scores[batch_idx], next_indices[batch_idx])
            ):
                batch_beam_idx = batch_idx * self.group_size + next_index
                # add to generated hypotheses if end of sentence
                if (eos_token_id is not None) and (next_token.item() in eos_token_id):
                    # if beam_token does not belong to top num_beams tokens, it should not be added
                    is_beam_token_worse_than_top_num_beams = beam_token_rank >= self.group_size
                    if is_beam_token_worse_than_top_num_beams:
                        continue
                    if beam_indices is not None:
                        beam_index = beam_indices[batch_beam_idx]
                        beam_index = beam_index + (batch_beam_idx,)
                    else:
                        beam_index = None

                    self._beam_hyps[batch_group_idx].add(
                        input_ids[batch_beam_idx].clone(),
                        next_score.item(),
                        beam_indices=beam_index,
                        generated_len=cur_len - decoder_prompt_len,
                    )

实现添加beam的功能，如果数量超过了设置的num_beams，结果最差的会被删除

如果有类似这样的需求，可以这样来实现

class BeamHypotheses:

    def __len__(self):
        return len(self.beams)

    def add(self):
        if len(self) < self.num_beams or score > self.worst_score:
            self.beams.append((score, hyp, beam_indices))
            if len(self) > self.num_beams:
                sorted_next_scores = sorted([(s, idx) for idx, (s, _, _) in enumerate(self.beams)])
                del self.beams[sorted_next_scores[0][1]]
                self.worst_score = sorted_next_scores[1][0]
            else:
                self.worst_score = min(score, self.worst_score)

下面的代码展示了有多个batch_size，要保存多个结果时怎么写代码

其中

i 循环的是batch_size
j 循环的是每个问题要保存的回答数

def finalize(
    self,
    input_ids: torch.LongTensor,
    final_beam_scores: torch.FloatTensor,
    final_beam_tokens: torch.LongTensor,
    final_beam_indices: torch.LongTensor,
    max_length: int,
    pad_token_id: Optional[int] = None,
    eos_token_id: Optional[Union[int, List[int]]] = None,
    beam_indices: Optional[torch.LongTensor] = None,
    decoder_prompt_len: Optional[int] = 0,
) -> Tuple[torch.LongTensor]:

    # select the best hypotheses
    sent_lengths = input_ids.new(batch_size * self.num_beam_hyps_to_keep)
    best = []
    best_indices = []
    best_scores = torch.zeros(batch_size * self.num_beam_hyps_to_keep, device=self.device, dtype=torch.float32)

    # retrieve best hypotheses
    for i in range(batch_size):
        beam_hyps_in_batch = self._beam_hyps[i * self.num_beam_groups : (i + 1) * self.num_beam_groups]
        candidate_beams = [beam for beam_hyp in beam_hyps_in_batch for beam in beam_hyp.beams]
        sorted_hyps = sorted(candidate_beams, key=lambda x: x[0])
        for j in range(self.num_beam_hyps_to_keep):
            best_hyp_tuple = sorted_hyps.pop()
            best_score = best_hyp_tuple[0]
            best_hyp = best_hyp_tuple[1]
            best_index = best_hyp_tuple[2]
            sent_lengths[self.num_beam_hyps_to_keep * i + j] = len(best_hyp)

            # append hyp to lists
            best.append(best_hyp)

            # append indices to list
            best_indices.append(best_index)

            best_scores[i * self.num_beam_hyps_to_keep + j] = best_score

    # prepare for adding eos
    sent_lengths_max = sent_lengths.max().item() + 1
    sent_max_len = min(sent_lengths_max, max_length) if max_length is not None else sent_lengths_max
    decoded: torch.LongTensor = input_ids.new(batch_size * self.num_beam_hyps_to_keep, sent_max_len)

    # fill with hypotheses and eos_token_id if the latter fits in
    for i, (hypo, best_idx) in enumerate(zip(best, best_indices)):
        decoded[i, : sent_lengths[i]] = hypo
        if sent_lengths[i] < sent_max_len:
            decoded[i, sent_lengths[i]] = eos_token_id[0]

    return UserDict(
        {
            "sequences": decoded,
            "sequence_scores": best_scores,
            "beam_indices": indices,
        }
    )

下面是一些处理方法

一般可以设置这些参数：

temperature 温度为 0 将始终产生相同的输出。温度越高随机性越大！
top_p

动态设置tokens候选列表的大小。将可能性之和不超过特定值的top tokens列入候选名单。 Top p 通常设置为较高的值（如 0.75），目的是限制可能被采样的低概率 token 的长度。
top_k

允许其他高分tokens有机会被选中。这种采样引入的随机性有助于在很多情况下生成的质量。 top-k 参数设置为 3意味着选择前三个tokens。将如果 k 和 p 都启用，则 p 在 k 之后起作用。
repetition_penalty

重复性惩罚方法通过在模型推理过程中加入重复惩罚因子

优雅的实现处理操作

class LogitsProcessorList(list):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> torch.FloatTensor:
        for processor in self:
            scores = processor(input_ids, scores)
        return scores

RepetitionPenaltyLogitsProcessor

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
    score = torch.gather(scores, 1, input_ids)

    # if score < 0 then repetition penalty has to be multiplied to reduce the token probabilities
    score = torch.where(score < 0, score * self.penalty, score / self.penalty)

    scores.scatter_(1, input_ids, score)
    return scores

TopPLogitsWarper

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
    sorted_logits, sorted_indices = torch.sort(scores, descending=False)
    cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)

    # Remove tokens with cumulative top_p above the threshold (token with 0 are kept)
    sorted_indices_to_remove = cumulative_probs <= (1 - self.top_p)
    # Keep at least min_tokens_to_keep
    sorted_indices_to_remove[..., -self.min_tokens_to_keep :] = 0

    # scatter sorted tensors to original indexing
    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
    scores = scores.masked_fill(indices_to_remove, self.filter_value)
    return scores

TopKLogitsWarper

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
    indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
    scores = scores.masked_fill(indices_to_remove, self.filter_value)
    return scores

TemperatureLogitsWarper

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
    scores = scores / self.temperature
    return scores

在这里插入图片描述