用 tensor-parallel 多卡并发推理大模型

利用 tensor-parallel 把模型训练与推理的 workload 平均分布到多块 GPU,一方面可以提高推理速度,另一方面 vram 的负载平衡也让复杂的 prompt 能被轻松处理。

import 相关的 libs:

# torch version 2.0.0
import torch
# tensor-parallel version 1.0.22
from tensor_parallel import TensorParallelPreTrainedModel
# transformer version 4.28.0.dev0
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

加载 LLaMA-7B 并转化为 TensorParallelPreTrainedModel:

model = LlamaForCausalLM.from_pretrained("./llama-7b-hf", torch_dtype=torch.float16)
model = TensorParallelPreTrainedModel(model, ["cuda:0", "cuda:1"])

加载 tokenizer 并进行推理:

tokenizer = LlamaTokenizer.from_pretrained("./llama-7b-hf")

tokens = tokenizer("Hi, how are you?", return_tensors="pt")
tokenizer.decode(model.generate(tokens["input_ids"].cuda(0), attention_mask=tokens["attention_mask"].cuda(0))[0])

# 输出:
#  'Hi, how are you? I'm a 20 year old girl from the Netherlands'

tokens = tokenizer("Once upon a time, there was a lonely computer ", return_tensors="pt")
tokenizer.decode(model.generate(tokens["input_ids"].cuda(0), attention_mask=tokens["attention_mask"].cuda(0), max_length=256)[0])

# 输出:
#  'Once upon a time, there was a lonely computer. It was a very old computer, and it had been sitting in a box for a long time. It was very sad, because it had no friends.\nOne day, a little girl came to the computer. She was very nice, and she said, “Hello, computer. I’m going to be your friend.”\nThe computer was very happy. It said, “Thank you, little girl. I’m very happy to have you as my friend.”\nThe little girl said, “I’m going to call you ‘Computer.’”\n“That’s a good name,” said Computer.\nThe little girl said, “I’m going to teach you how to play games.”\n“That’s a good idea,” said Computer.\nThe little girl said, “I’m going to teach you how to do math.”\nThe little girl said, “I’m going to teach you how to write stories.”\nThe little girl said, “I’m going to teach you how to draw pictures.”\nThe little girl said, “I’m going to teach you how to play music.”\nThe little girl said, “I’m'

在这里,我们把我们的推理逻辑平均分布到了两块 GPU 上。

tensor parallel 在主流的推理框架已经很好的支持了,vLLMlightllm 都是很好的选择。

  • 7
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值