大模型文本生成策略解读

1 前言

之前只知道topp,topk,temperature,num_beams,repetition_penalty随着工作中接触更多,penalty_alpha,do_sample这些参数,发现有时候一头雾水,是时候系统性梳理一下了。

首先我们需要读huggingface的transformers库的源码,里边类的定义的时候,会写很多注释,这比我们到网上找资料要强很多。class GenerationConfig(PushToHubMixin):和class GenerationMixin:是需要仔细看的。

GenerationMixin:

transformers/src/transformers/generation/utils.py at add-chat-glm · huggingface/transformers (github.com)

GenerationConfig:

transformers/src/transformers/generation/configuration_utils.py at add-chat-glm · huggingface/transformers (github.com)

然后就是huggingface库的官方doc:

Text generation strategies (huggingface.co)

和代码里给的注释差不多,喜欢阅读文档的可以去官网看。

一共给出了7种解码策略。我看2022年底的时候说SOTA策略是Contrastive search 对比搜索,但是我司之前对比过,到现在还是用的Multinomial sampling 多项式抽样,do_sample=Truenum_beams=1,然后剩下的就是很重要的:

temperature=0.3
top_k=30
top_p=0.85
do_sample = True # 总开关
num_beams = 1 # do_sample = True和num_beams = 1结合开启多项式抽样
repetition_penalty = 1.2
max_new_tokens = 1024
min_new_token = 20

2 Decoding strategies 解码策略

Certain combinations of the generate() parameters, and ultimately generation_config, can be used to enable specific decoding strategies. If you are new to this concept, we recommend reading this blog post that illustrates how common decoding strategies work. 某些组合的 generate() 参数,最终 generation_config ,可用于启用特定的解码策略。如果您对这个概念还不熟悉,我们建议阅读这篇博客文章,了解常见的解码策略是如何工作的。

Here, we’ll show some of the parameters that control the decoding strategies and illustrate how you can use them. 在这里,我们将展示一些控制解码策略的参数,并说明如何使用它们。

Greedy Search 贪婪搜索

generate uses greedy search decoding by default so you don’t have to pass any parameters to enable it. This means the parameters num_beams is set to 1 and do_sample=False. 默认情况下, generate 使用贪婪搜索解码,因此您无需传递任何参数来启用它。这意味着参数 num_beams 设置为1, do_sample=False

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
​
>>> prompt = "I look forward to"
>>> checkpoint = "distilgpt2"
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> outputs = model.generate(**inputs)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['I look forward to seeing you all again!\n\n\n\n\n\n\n\n\n\n\n']

Contrastive search 对比搜索

The contrastive search decoding strategy was proposed in the 2022 paper A Contrastive Framework for Neural Text Generation. It demonstrates superior results for generating non-repetitive yet coherent long outputs. To learn how contrastive search works, check out this blog post. The two main parameters that enable and control the behavior of contrastive search are penalty_alpha and top_k: 2022年的论文《神经文本生成的对比框架》提出了对比搜索解码策略。它展示了在生成非重复但连贯的长输出方面的优越结果。要了解对比搜索的工作原理,请查看这篇博客文章。启用和控制对比搜索行为的两个主要参数是 penalty_alphatop_k

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
​
>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
​
>>> prompt = "Hugging Face Company is"
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> outputs = model.generate(**inputs, penalty_alpha=0.6, top_k=4, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Hugging Face Company is a family owned and operated business. We pride ourselves on being the best
in the business and our customer service is second to none.\n\nIf you have any questions about our
products or services, feel free to contact us at any time. We look forward to hearing from you!']

Multinomial sampling 多项式抽样

As opposed to greedy search that always chooses a token with the highest probability as the next token, multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary given by the model. Every token with a non-zero probability has a chance of being selected, thus reducing the risk of repetition. 与贪婪搜索不同,贪婪搜索总是选择具有最高概率的标记作为下一个标记,而多项式采样(也称为祖先采样)是根据模型给出的整个词汇表上的概率分布随机选择下一个标记。每个具有非零概率的标记都有被选择的机会,从而降低了重复的风险。

To enable multinomial sampling set do_sample=True and num_beams=1. 要启用多项式抽样,请设置 do_sample=Truenum_beams=1

>>> from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
>>> set_seed(0)  # For reproducibility
​
>>> checkpoint = "gpt2-large"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
​
>>> prompt = "Today was an amazing day because"
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Today was an amazing day because when you go to the World Cup and you don\'t, or when you don\'t get invited,
that\'s a terrible feeling."']

Beam-search decoding 束搜索解码

Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually chooses the hypothesis that has the overall highest probability for the entire sequence. This has the advantage of identifying high-probability sequences that start with lower probability initial tokens and would’ve been ignored by the greedy search. 与贪婪搜索不同,束搜索解码在每个时间步保留了几个假设,并最终选择整个序列中具有最高概率的假设。这样做的好处是可以识别以较低概率初始标记开头的高概率序列,而这些序列在贪婪搜索中可能会被忽略。

To enable this decoding strategy, specify the num_beams (aka number of hypotheses to keep track of) that is greater than 1. 要启用这种解码策略,请指定大于1的 num_beams (也称为要跟踪的假设数量)。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
​
>>> prompt = "It is astonishing how one can"
>>> checkpoint = "gpt2-medium"
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
​
>>> outputs = model.generate(**inputs, num_beams=5, max_new_tokens=50)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['It is astonishing how one can have such a profound impact on the lives of so many people in such a short period of
time."\n\nHe added: "I am very proud of the work I have been able to do in the last few years.\n\n"I have']

Beam-search multinomial sampling 束搜索多项式抽样

As the name implies, this decoding strategy combines beam search with multinomial sampling. You need to specify the num_beams greater than 1, and set do_sample=True to use this decoding strategy. 正如其名称所示,这种解码策略将束搜索与多项抽样相结合。您需要指定大于1的 num_beams ,并设置 do_sample=True 以使用这种解码策略。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, set_seed
>>> set_seed(0)  # For reproducibility
​
>>> prompt = "translate English to German: The house is wonderful."
>>> checkpoint = "t5-small"
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
​
>>> outputs = model.generate(**inputs, num_beams=5, do_sample=True)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'Das Haus ist wunderbar.'

Diverse beam search decoding 多样化的束搜索解码

The diverse beam search decoding strategy is an extension of the beam search strategy that allows for generating a more diverse set of beam sequences to choose from. To learn how it works, refer to Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. This approach has three main parameters: num_beams, num_beam_groups, and diversity_penalty. The diversity penalty ensures the outputs are distinct across groups, and beam search is used within each group. 多样化的束搜索解码策略是束搜索策略的扩展,允许生成更多样化的束序列供选择。要了解其工作原理,请参考《多样化束搜索:从神经序列模型中解码多样化解决方案》。该方法有三个主要参数: num_beamsnum_beam_groups ,和 diversity_penalty 。多样性惩罚确保输出在各组之间是不同的,并且在每个组内使用束搜索。

>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
​
>>> checkpoint = "google/pegasus-xsum"
>>> prompt = (
...     "The Permaculture Design Principles are a set of universal design principles "
...     "that can be applied to any location, climate and culture, and they allow us to design "
...     "the most efficient and sustainable human habitation and food production systems. "
...     "Permaculture is a design system that encompasses a wide variety of disciplines, such "
...     "as ecology, landscape design, environmental science and energy conservation, and the "
...     "Permaculture design principles are drawn from these various disciplines. Each individual "
...     "design principle itself embodies a complete conceptual framework based on sound "
...     "scientific principles. When we bring all these separate  principles together, we can "
...     "create a design system that both looks at whole systems, the parts that these systems "
...     "consist of, and how those parts interact with each other to create a complex, dynamic, "
...     "living system. Each design principle serves as a tool that allows us to integrate all "
...     "the separate parts of a design, referred to as elements, into a functional, synergistic, "
...     "whole system, where the elements harmoniously interact and work together in the most "
...     "efficient way possible."
... )
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
​
>>> outputs = model.generate(**inputs, num_beams=5, num_beam_groups=5, max_new_tokens=30, diversity_penalty=1.0)
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'The Design Principles are a set of universal design principles that can be applied to any location, climate and
culture, and they allow us to design the'

This guide illustrates the main parameters that enable various decoding strategies. More advanced parameters exist for the generate method, which gives you even further control over the generate method’s behavior. For the complete list of the available parameters, refer to the API documentation. 本指南说明了启用各种解码策略的主要参数。对于 generate 方法,还存在更高级的参数,可以进一步控制 generate 方法的行为。有关可用参数的完整列表,请参阅API文档。

Assisted Decoding 辅助解码

Assisted decoding is a modification of the decoding strategies above that uses an assistant model with the same tokenizer (ideally a much smaller model) to greedily generate a few candidate tokens. The main model then validates the candidate tokens in a single forward pass, which speeds up the decoding process. Currently, only greedy search and sampling are supported with assisted decoding, and doesn’t support batched inputs. To learn more about assisted decoding, check this blog post. 辅助解码是上述解码策略的修改版本,它使用一个助理模型和相同的分词器(理想情况下是一个更小的模型)来贪婪地生成一些候选标记。主模型然后在单个前向传递中验证候选标记,从而加快解码过程。目前,只支持辅助解码的贪婪搜索和抽样,并不支持批量输入。要了解更多关于辅助解码的信息,请查看这篇博客文章。

To enable assisted decoding, set the assistant_model argument with a model. 要启用辅助解码,请使用模型设置 assistant_model 参数。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
​
>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

When using assisted decoding with sampling methods, you can use the temperature argument to control the randomness just like in multinomial sampling. However, in assisted decoding, reducing the temperature will help improving latency. 在使用辅助解码和抽样方法时,您可以使用 temperature 参数来控制随机性,就像在多项式抽样中一样。然而,在辅助解码中,降低温度将有助于提高延迟。

>>> from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
>>> set_seed(42)  # For reproducibility
​
>>> prompt = "Alice and Bob"
>>> checkpoint = "EleutherAI/pythia-1.4b-deduped"
>>> assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
​
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
>>> inputs = tokenizer(prompt, return_tensors="pt")
​
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint)
>>> assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
>>> outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Alice and Bob are going to the same party. It is a small party, in a small']

3 参考

参考文章:Generating Human-level Text with Contrastive Search in Transformers 🤗 (huggingface.co)

Generating Human-level Text with Contrastive Search in Transformers 🤗 (huggingface.co)

  • 23
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值