deepspeed 推理基于基于transformer的模型

deepspeed 推理基于基于transformer的模型
https://www.deepspeed.ai/tutorials/inference-tutorial/

DeepSpeed-Inference:支持模型并行,减少延迟与成本【大GPU比较贵,使用cpu比较慢】,使用量化技术。

DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace,

要在多块的GPU上进行推理工作,需要提供模型的并行度和检查点的信息,或者已经加载好的模型。deepspeed 将会做剩下的事情。它将会自动的分区这个模型,使用相匹配的cuda kernel 在我们的模型中,并且管理GPU之间的通讯机制。【目前兼容的模型有这些】

from .containers import HFGPT2LayerPolicy
from .containers import HFBertLayerPolicy
from .containers import BLOOMLayerPolicy
from .containers import HFGPTJLayerPolicy
from .containers import HFGPTNEOLayerPolicy
from .containers import GPTNEOXLayerPolicy
from .containers import HFOPTLayerPolicy
from .containers import MegatronLayerPolicy
from .containers import HFDistilBertLayerPolicy
from .containers import HFCLIPLayerPolicy
from .containers import LLAMALayerPolicy
from .containers import UNetPolicy
from .containers import VAEPolicy
from .containers import LLAMA2LayerPolicy
from .containers import InternLMLayerPolicy
【也并不是对所有模型都兼容使用,模型并行的操作,但是数据并行是能做的】

Initializing for Inference

使用deepspeed 去推理模型,使用init_inference api去加载推理的模型。可以指定mp 的度数。如果模型没有被加载的话,可以使用模型路径去加载。

为了使用高性能的核,许哟啊去设置replace_with_kernel_inject to True对于兼容的模型。

For models not supported by DeepSpeed, the users can submit a PR that defines a new policy in replace_policy class that specifies the different parameters of a Transformer layer,
【可以根据这个作者的编程策略去将自己的模型加入这个框架。需要去实现一个代理类,这个代理类创造了一个映射,将用户自己的定义的模型层的信息映射到deepspeed推理优化层的信息】

if args.pre_load_checkpoint:
    model = model_class.from_pretrained(args.model_name_or_path)
else:
    model = model_class()
...

import deepspeed

#Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                 tensor_parallel={"tp_size": 2},
                                 dtype=torch.half,
                                 checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
                                 replace_with_kernel_inject=True)
model = ds_engine.module
output = model('Input String')

为了运行模型并行在我们支持的kernels 的模型上,可以传递一个注入策略,这个策略需要去展示两个特殊的层,在transformer的encoder/decoder中:
r: 1) the attention output GeMM and 2) layer output GeMM
We need these part of the layer to add the required all-reduce communication between GPUs to merge the partial results across model-parallel ranks. 【将多个GPU的数据综合到一个GPU】

# create the model
import transformers
from transformers.models.t5.modeling_t5 import T5Block

import deepspeed

pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    tensor_parallel={"tp_size": world_size},
    dtype=torch.float,
    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
)
output = pipe('Input String')

如何去写这些东西呢,还得看一下这个是需要我们怎么去写的。

load checkpoint

For the models trained using HuggingFace, the model checkpoint can be pre-loaded using the from_pretrained API as shown above.
For Megatron-LM models trained with model parallelism, we require a list of all the model parallel checkpoints passed in JSON config
【这里我们要明确,我们加载的模型是为了去使用deepspeed inference这个框架来使用模型并行的。

"checkpoint.json":
{
    "type": "Megatron",
    "version": 0.0,
    "checkpoints": [
        "mp_rank_00/model_optim_rng.pt",
        "mp_rank_01/model_optim_rng.pt",
    ],
}

For models that are trained with DeepSpeed, the checkpoint json file only requires storing the path to the model checkpoints.

"checkpoint.json":
{
    "type": "ds_model",
    "version": 0.0,
    "checkpoints": "path_to_checkpoints",
}

deepspeed inference 能够
DeepSpeed supports running different MP degree for inference than from training. For example, a model trained without any MP can be run with MP=2, or a model trained with MP=4 can be inferenced without any MP. DeepSpeed automatically merges or splits checkpoints during initialization as necessary.
【在使用slurm系统的时候,我们需要指定几个GPU,这其实就是为了限制能探测到本机上有多少GPU,然后去使用GPU】

End-to-End GPT NEO 2.7B InferencePermalink

deepspeed inference 可以被使用hf的pipline,
下面是一个端到端的代码,将deepspeed inference 与hf的pipline结合起来

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           tensor_parallel={"tp_size": world_size},
                                           dtype=torch.float,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

这个文本生成的流水线,首先是要将文本呢tokenizer,对于chat模型还需要根据其模板转换,输入到模型之后输出,对输出的模型解码,要特殊字符替换,将之前的做的前处理替换,解码,将一些不可描述之字符转换为字符的形式。

The above script modifies the model in HuggingFace text-generation pipeline to use DeepSpeed inference. Note that here we can run the inference on multiple GPUs using the model-parallel tensor-slicing across GPUs even though the original model was trained without any model parallelism and the checkpoint is also a single GPU checkpoint.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值