transformer库generate函数参数测试_transformer generate-CSDN博客

本文链接：https://blog.csdn.net/m0_62053105/article/details/134519983

第一篇文章想要记录一下自己在玩transformer库中模型"gpt2"进行生成文字过程中调整generate函数参数对生成的文字的影响。

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "I am happy because"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# generate up to 30 tokens
outputs = model.generate(input_ids, do_sample=False, max_length=40,pad_token_id=tokenizer.eos_token_id)
output_text=tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(output_text)

1.do_sample(`bool`, optional, defaults to `False`)

此时设置do_sample=False，这时使用的是贪心解码方式（greedy decoding）

打印如下

['I am happy because I am a good person. I am a good person. I am a good person. I am a good person. I am a good person. I am a good person.']

可以看出此时gpt2出现了严重的“复读机问题” ，但如果设置do_sample=True，则此时使用多项式采样的方式，生成如下文字：

["I am happy because it has been a challenge for me to make a great job. My life has been incredibly simple and I'm so grateful. I'm a student at a university where even though I"]

同时，多项式采样每一次生成的结果并不相同，而贪心解码一般是一样的。

2. temperature(`float`, optional, defaults to 1.0)

temperature参数是用于控制生成文本的随机性和多样性，其本质是调整了模型输出的logits概率分布。当temperature较高时，会更平均地分配概率给各个token，这导致生成的文本更具随机性和多样性；temperature较低接近0时，会倾向于选择概率最高的token，从而使生成的文本更加确定和集中。

注意，当使用采样方式时才可以使用该参数。

当设置temperature=2.0时可以看出gpt2已经开始胡言乱语了

['I am happy because your comments help get my campaign launched at my level of fame so as to take more heat from me for my \'flab policy\' being more "savage":\n\nFirst']

当设置temperature=0.5时打印如下

['I am happy because I am a proud American. I am proud that I am a proud American. I am proud that I am a proud American. I am proud that I am a proud American.']

可以发现此时同样可能产生复读机问题

3.top_p (`float`, optional, defaults to 1.0)

top_p与top_k类似，但是是使用累计概率。top_p的值越小，则可采样的词越少，会出现复读机问题。若top_p=0.5:

['I am happy because I have a good partner and we have a good family. We are happy because we have a good partner and we have a good family. We have a good family. We have']

若top_p=0.1：

['I am happy because I am a woman. I am a woman. I am a woman. I am a woman. I am a woman. I am a woman. I am a woman. I']

4.top_k (`int`, optional, defaults to 50)

该参数用于在生成下一个token时，限制模型只能考虑前k个概率最高的token。

若top_k=100：

['I am happy because there was no reason for us to be here," Karras said of his own life after passing his late father on.\n\nThe family in April shared an arrangement with E']

若top_k=10：

['I am happy because the people here are doing what it takes to win a championship in the NBA," said James, who will play for the Celtics. "I\'m a believer in basketball. I know']

看起来top_k=10时模型输出比top_k=100时的输出更加合理许多。

5.repetition_penalty (`float`, optional, defaults to 1.0)

这是一个重复惩罚的参数，用于缓解复读机现象。

上文观察到当设置top_p=0.1时会出现明显的复读机现象，下面进行测试

当设置top_p=0.1且repetition_penalty=2.0时：

['I am happy because I have a lot of friends who are very good at it.\n"It\'s not like they\'re going to be able, you know? They\'ll just go out and do']

可以看到重复惩罚是显著有效的