1.单卡-base模型推理:
【base模型实质上就是一个按照max_length生成的工具】
from transformers import GPT2LMHeadModel, GPT2Tokenizer
def generate_text(prompt_text):
model_name = "gpt2"
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
input = tokenizer(prompt_text, return_tensors='pt', padding=True, truncation=True)
max_length = 100 # max_length:初始文本token+推理的token数
model.to('cuda'),input.to('cuda')
output = model.generate(input.input_ids,attention_mask=input.attention_mask, max_length=100, pad_token_id=tokenizer.eos_token_id)
print('The number of tokens:',len(output[0]))
generated_text_1 = tokenizer.decode(output[0], skip_special_tokens=True)
return generated_text_1
prompt = "hello,how are you today?"
generated_text = generate_text(prompt)
print(generated_text[0])
---------------------------------------------------------------------------------------------------------------------------------
2.单卡-chat模型推理:
注意:
下面的推理过程,只是以gpt2这个小模型为例,实际上这是一个base模型,并没有经过指令微调,无法chat问答,这个框架没有任何意义!
base模型无法以问答格式回答问题,同样,输出也会按照最大max_new_token输出【没有经过chat微调,没有学到|im_start|,|im_end|,|endoftext|这几个特殊字符,无法输出/推理出这几个字符,就无法按照模式策略提前终止输出【early_stop】!只能按照最大输出截止!】
from transformers import AutoModelForCausalLM,AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('gpt2') model = AutoModelForCausalLM.from_pretrained('gpt2') model.to('cuda') # 没有chat对话函数就用apply_chat_template+generate生成函数 # ------------------------ 输入格式 --------------------- prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # --------------------------- 推理 ---------------- model_inputs = tokenizer([text], return_tensors="pt").to('cuda') generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=128 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] print('The number of tokens:',len(generated_ids[0])) # 这里对标上面的max_new_token response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response)
---------------------------------------------------------------------------------------------------------------------------------
3.多GPU分布式推理
当一张显卡推理时报OOM错,通常采取模型分布式存储。transformer库的模型自带了分布式推理
即from_pretrained时的参数中,device_map
device_map='auto',自动选择所有显卡,分布式存储推理
如gpt2有12个块,也可以字典格式指定存储【指定必须覆盖模型所有的参数】
device_map = {"block1": 0,
"block2": 0,"block1": 1,
......}
为了最有效率,确保设备映射以顺序方式将参数放在GPU上,以避免在GPU之间进行多次数据传输。
- 例如,不要将第一个权重放在GPU 0上,然后将权重放在GPU 1上,最后的权重再放回GPU 0
def main():
model = AutoModelForCausalLM.from_pretrained(
"./Xunzi-Qwen1.5-7B_chat",
torch_dtype="auto",
device_map="auto", # 指定device_map会自动移动模型到GPU
trust_remote_code=True,
fp16=False, # 指定类型
)
tokenizer = AutoTokenizer.from_pretrained("./Xunzi-Qwen1.5-7B_chat")
tokenizer.pad_token = tokenizer.eos_token
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print('text:/n',text)
'''
text:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant
'''
model_inputs = tokenizer([text], return_tensors="pt").to(device) # 这里单条text加不加列表输出是一样的[1,len],如果是batch择输出[batch,len]
print('model_inputs\n',model_inputs)
'''
# 也包含了im_start,im_end这几个特殊字符用来区分句子起始结束
{'input_ids': tensor([[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13,
151645, 198, 151644, 872, 198, 35127, 752, 264, 2805,
16800, 311, 3460, 4128, 1614, 13, 151645, 198, 151644,
77091, 198]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1]], device='cuda:0')}
'''
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
常用的分布式推理/框架有accelerate,后面再补充