今天需要使用profiler来分析LLM的性能,所以特地的尝试了一下,我这里把示例代码分享给搭建,希望大家编程顺利:
import time
import torch
from transformers import AutoTokenizer, AutoModel
import torch.profiler as profiler
model_name_or_path = 'THUDM/chatglm-6b'
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name_or_path, trust_remote_code=True).half().cuda()
model = model.eval()
prompt = "你好"
inputs = tokenizer([prompt], return_tensors="pt")
inputs = inputs.to("cuda")
prof = profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2,
repeat=1),
)
with torch.no_grad():
for i in range(5):
result = model(**inputs)
with torch.no_grad():
for i in range(10):
start = time.perf_counter()
# response, history = model.chat(tokenizer, "你好", history=[])
# print(response)
result = model(**inputs)
hf_cost = (time.perf_counter() - start) * 1000
print("Speed tokenizer:", hf_cost)
prof.step()
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
我给的示例是chatglm的,有需要的可以换成其他的模型,原理是一样的。