安装库
加载7B的LLM在GPU上进行推理,我这24G的显存居然一次推理都执行不了,Out of Memory。
这里采用Quanto库进行对模型进行量化
quanto==0.1.0版本的库,需要torch版本>2.2.0, 建议先将torch进行升级
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
然后安装
pip install quanto
pip install accelerate
transformers 版本 == 4.40.0
quanto==0.1.0
-----------------------------------------------------------------------------
quanto量化过程中需要gcc版本大于9.0.0(这个可以自己升级下)
----------------------------------------------------------------------------
量化
模型量化后,再执行生成文本
代码如下, 只用了不到13G的显存就能够完成推理。
from transformers import AutoTokenizer,AutoModelForCausalLM, QuantoConfig
import torch
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
def generate_text(model,input_text):
#inputs = tokenizer(input_text, return_tensors='pt', max_length=64, padding='max_length', truncation=True)
inputs = tokenizer(input_text, return_tensors='pt')
#print(inputs)
model = model#.to(device)
inputs = inputs.to(device)
outputs = model.generate(**inputs,max_new_tokens=50)
decoded_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
#decoded_output = [tokenizer.decode(
# output,
# skip_special_tokens=True,
# clean_up_tokenization_spaces=True,
# )
# for output in outputs
# ]
##decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
return decoded_output
tokenizer = AutoTokenizer.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",padding_size="left")
#model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b")
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b", device_map="cuda:1", quantization_config=quantization_config)
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
result = generate_text(quantized_model, "How many steps can put elephants into a refrigerator?")
print(result)
------------------------------------------------------------------------------------------
huggingface 分布式
另一种方法,huggingface再加载模型的时候可以直接用bfloat16,然后自动分配到各个GPU上,进行分布式运行,比直接量化方便很多
model = AutoModelForCausalLM.from_pretrained("cognitivecomputations/dolphin-2.9-llama3-8b",torch_dtype=torch.bfloat16,device_map="auto")