推理阶段。
在后台,1. DeepSpeed会把运行高性能kernel(kernel injection),加快推理速度,这些对用户是透明的; 2. DeepSpeed会根据mp_size来将模型放置在多个GPU卡上,自动模型并行;
import os import torch import transformers import deepspeed local_rank = int(os.getenv("LOCAL_RANK", "0")) world_size = int(os.getenv("WORLD_SIZE", "1")) # create the model pipeline pipe = transformers.pipeline(task="text2text-generation", model="google/t5-v1_1-small", device=local_rank) # Initialize the DeepSpeed-Inference engine pipe.model = deepspeed.init_inference( pipe.model, mp_size=world_size, dtype=torch.float ) output = pipe('Input String')
Train好的模型,即使没有用Model并行,或mp_size是另一个数,DeepSpeed支持推理时改变mp_size
量化推理:dtype支持torch.int8
新版推理:</