深度学习系列69：模型部署的基础知识

本文链接：https://blog.csdn.net/kittyzc/article/details/140607207

参考https://mp.weixin.qq.com/s?__biz=MzI4MDcxNTY2MQ==&mid=2247488952&idx=1&sn=880d3ad47a8fb3eab56514135f0e643b&chksm=ebb51d5adcc2944c276af19e8cff5e73c934f8811706be0a94c5f47f9e767c902939903e6b95&scene=21#wechat_redirect

1. 基本流水线

1.1 介绍

为了让模型最终能够部署到某一环境上，开发者们可以使用任意一种深度学习框架来定义网络结构，并通过训练确定网络中的参数。
之后，模型的结构和参数会被转换成一种只描述网络结构的中间表示，一些针对网络结构的优化会在中间表示上进行。
最后，用面向硬件的高性能编程框架(如 CUDA，OpenCL）编写，能高效执行深度学习网络中算子的推理引擎会把中间表示转换成特定的文件格式，并在对应硬件平台上高效运行模型。
在这里插入图片描述
这里用一个例子：

import os,cv2,requests,torch
import torch.onnx
import numpy as np
from torch import nn
from torch.nn.functional import interpolate

class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor):
        super().__init__()
        self.upscale_factor = upscale_factor
        self.conv1 = nn.Conv2d(3,64,kernel_size=9,padding=4)
        self.conv2 = nn.Conv2d(64,32,kernel_size=1,padding=0)
        self.conv3 = nn.Conv2d(32,3,kernel_size=5,padding=2)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = interpolate(x,scale_factor=self.upscale_factor,mode='bicubic',align_corners=False)
        out = self.relu(self.conv1(x))
        out = self.relu(self.conv2(out))
        out = self.conv3(out)
        return out
        
# Download checkpoint and test image
urls = ['https://download.openmmlab.com/mmediting/restorers/srcnn/srcnn_x4k915_1x16_1000k_div2k_20200608-4186f232.pth',
    'https://raw.githubusercontent.com/open-mmlab/mmediting/master/tests/data/face/000001.png']
names = ['srcnn.pth', 'face.png']
for url, name in zip(urls, names):
    if not os.path.exists(name):
        open(name, 'wb').write(requests.get(url).content)

def init_torch_model():
    torch_model = SuperResolutionNet(upscale_factor=3)
    state_dict = torch.load('srcnn.pth')['state_dict']
    # Adapt the checkpoint
    for old_key in list(state_dict.keys()):
        new_key = '.'.join(old_key.split('.')[1:])
        state_dict[new_key] = state_dict.pop(old_key)
    torch_model.load_state_dict(state_dict)
    torch_model.eval()
    return torch_model

model = init_torch_model()
input_img = cv2.imread('face.png').astype(np.float32)

# HWC to NCHW
input_img = np.transpose(input_img, [2, 0, 1])
input_img = np.expand_dims(input_img, 0)

# Inference
torch_output = model(torch.from_numpy(input_img)).detach().numpy()

# NCHW to HWC
torch_output = np.squeeze(torch_output, 0)
torch_output = np.clip(torch_output, 0, 255)
torch_output = np.transpose(torch_output, [1, 2, 0]).astype(np.uint8)

# Show image
cv2.imwrite("face_torch.png", torch_output)

1.2 eager模式的优化

首先，pytorch2.0引入了一个简单的函数 torch.compile 来包装您的模型并返回一个编译后的模型。

compiled_model = torch.compile(model)

这个 compiled_model 保持着对模型的引用，并将 forward 函数编译为更优化的版本。在编译模型时，我们给出了几个参数来配置：

def torch.compile(
    model: Callable,
    *,
    mode: Optional[str] = "default",
    dynamic: bool = False,
    fullgraph:bool = False,
    backend: Union[str, Callable] = "inductor",
# advanced backend options go here as kwargs
    **kwargs
) -> torch._dynamo.NNOptimizedModule
mode 指定编译器在编译时应该优化什么。

默认模式会尝试高效编译，即不花费太长时间编译，并且不使用额外内存。
其他模式，如 reduce-overhead 更多地减少框架开销，但会消耗少量额外内存。max-autotune 会编译很长时间，试图为您提供它可以生成的最快代码。
dynamic 指定是否开启针对动态形状的代码生成路径。某些编译器优化不能应用于动态形状程序。明确指定您想要一个具有动态形状还是静态形状的编译程序将有助于编译器为您提供更好的优化代码。

fullgraph 类似于 Numba 的 nopython. 它将整个程序编译成一个计算图，或者给出一个错误来解释为什么它不能这样做。大多数用户不需要使用此模式。如果您非常注重性能，那么您可以尝试使用它。

backend 指定要使用的编译器后端。默认情况下，使用 TorchInductor，但还有一些其他可用的。
完整代码：

import torch
import torchvision.models as models

model = models.resnet18().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
compiled_model = torch.compile(model)

x = torch.randn(16, 3, 224, 224).cuda()
optimizer.zero_grad()
out = compiled_model(x)
out.sum().backward()
optimizer.step()

compile有多重模式，可以根据需要选择：

# API NOT FINAL
# default: optimizes for large models, low compile-time
#          and no extra memory usage
torch.compile(model)

# reduce-overhead: optimizes to reduce the framework overhead
#                and uses some extra memory. Helps speed up small models
torch.compile(model, mode="reduce-overhead")

# max-autotune: optimizes to produce the fastest model,
#               but takes a very long time to compile
torch.compile(model, mode="max-autotune")

优化后的模型statedict可以保存为pt文件：

torch.save(optimized_model.state_dict(), "foo.pt")
# both these lines of code do the same thing
torch.save(model.state_dict(), "foo.pt")

此外，我们将引入一种称为 torch.export 的模式，该模式会为需要高保证、可预测延迟的环境谨慎地导出整个模型和守卫（guards）。

exported_model = torch._dynamo.export(model, input)
torch.save(exported_model, "foo.pt")

另外，backend有多种可以选择：
在这里插入图片描述

2. 中间表示层

2.1 onnx框架

onnx是把动态代码改为静态图的一个框架，pytorch自带onnx转换工具。
首先是trace方式，给定一组输入，再实际执行一遍模型，即把这组输入对应的计算图记录下来，保存为 ONNX 格式。export 函数用的就是追踪导出方法，需要给任意一组输入，让模型跑起来。我们的测试图片是三通道，256x256大小的，这里也构造一个同样形状的随机张量。

x = torch.randn(1, 3, 256, 256)
with torch.no_grad():
    torch.onnx.export(
        model,
        x,
        "srcnn.onnx",
        opset_version=11,
        input_names=['input'],
        output_names=['output'])

opset_version 表示 ONNX 算子集的版本。深度学习的发展会不断诞生新算子，为了支持这些新增的算子，ONNX会经常发布新的算子集，目前已经更新15个版本。我们令 opset_version = 11，即使用第11个 ONNX 算子集，是因为 SRCNN 中的 bicubic （双三次插值）在 opset11 中才得到支持。剩下的两个参数 input_names, output_names 是输入、输出 tensor 的名称。

2.2 转静态图的挑战

并不是所有的模型都能顺利转onnx，主要挑战包括：

模型的动态化。
新算子的实现。
中间表示与推理引擎的兼容问题。
在上面的onnx模型里面，如果导出onnx时设置的upscale_factor为3，那么onnx的模型就只能放大3倍。如果我们需要一个放大 4 倍的模型，需要重新生成一遍模型，再做一次到 ONNX 的转换。很多时候，这个转换和后续工作是非常费时费事的。
还是以上面的代码为例，我们把upscale_factor从模型定义时输入改为inference时动态输入：

torch_output = model(torch.from_numpy(input_img),torch.tensor(3)).detach().numpy()

相对应的，模型__init__部分删去upscale_factor，在forward里加入upscale_factor变量。注意必须要用torch.tensor(3) 代替 3作为输入，不然onnx导出的时候会出错。
但是，导出 ONNX 时却报了一条 TraceWarning 的警告。这条警告说有一些量可能会追踪失败。虽然我们把模型推理的输入设置为了两个，但 ONNX 模型还是长得和原来一模一样，只有一个叫 " input " 的输入。
我们需要去研究这个算子到底发生了什么变化。仔细观察 Netron 上可视化出的 ONNX 模型，可以发现在 PyTorch 中无论是使用最早的 nn.Upsample，还是后来的 interpolate，PyTorch 里的插值操作最后都会转换成 ONNX 定义的 Resize 操作。也就是说，所谓 PyTorch 转 ONNX，实际上就是把每个 PyTorch 的操作映射成了 ONNX 定义的算子。
在这里插入图片描述

因此，我们需要参照原先的Interpolate算子，自己写一个新的Interpolate算子，将这个3添加到onnx的resize算子上。在之前放大3倍的模型中，这个参数被固定成了[1, 1, 3, 3]。因此，在插值算子中，我们希望模型的第二个输入是一个 [1, 1, w, h] 的张量，其中 w 和 h 分别是图片宽和高的放大倍数。
算子的推理行为由算子的 foward 方法决定。该方法的第一个参数必须为 ctx，后面的参数为算子的自定义输入，我们设置两个输入，分别为被操作的图像和放缩比例。为保证推理正确，需要把 [1, 1, w, h] 格式的输入对接到原来的 interpolate 函数上。我们的做法是截取输入张量的后两个元素，把这两个元素以 list 的格式传入 interpolate 的 scale_factor 参数。

class NewInterpolate(torch.autograd.Function):

    @staticmethod
    def symbolic(g, input, scales):
        return g.op("Resize",
                    input,
                    g.op("Constant",
                         value_t=torch.tensor([], dtype=torch.float32)),
                    scales,
                    coordinate_transformation_mode_s="pytorch_half_pixel",
                    cubic_coeff_a_f=-0.75,
                    mode_s='cubic',
                    nearest_mode_s="floor")

    @staticmethod
    def forward(ctx, input, scales):
        scales = scales.tolist()[-2:]
        return interpolate(input,
                           scale_factor=scales,
                           mode='bicubic',
                           align_corners=False)

映射到 ONNX 的方法由一个算子的 symbolic 方法决定。symbolic 方法第一个参数必须是g，之后的参数是算子的自定义输入，和 forward 函数一样。ONNX 算子的具体定义由 g.op 实现。g.op 的每个参数都可以映射到 ONNX 中的算子属性：
在这里插入图片描述
正如我们所期望的，导出的 ONNX 模型有了两个输入！第二个输入表示图像的放缩比例。

2.3 TorchScript

TorchScript 是一种序列化和优化 PyTorch 模型的格式，在优化过程中，一个torch.nn.Module 模型会被转换成 TorchScript 的 torch.jit.ScriptModule 模型。现在， TorchScript 也被常当成一种中间表示使用：

# IR生成
with torch.no_grad():
    jit_model = torch.jit.trace(model, dummy_input)

那么这个 IR 中到底都有些什么呢？我们可以可视化一下其中的 layer1 看看：

jit_layer1 = jit_model.layer1
print(jit_layer1.graph)

# graph(%self.6 : __torch__.torch.nn.modules.container.Sequential,
#       %4 : Float(1, 64, 56, 56, strides=[200704, 3136, 56, 1], requires_grad=0, device=cpu)):
#   %1 : __torch__.torchvision.models.resnet.___torch_mangle_10.BasicBlock = prim::GetAttr[name="1"](%self.6)
#   %2 : __torch__.torchvision.models.resnet.BasicBlock = prim::GetAttr[name="0"](%self.6)
#   %6 : Tensor = prim::CallMethod[name="forward"](%2, %4)
#   %7 : Tensor = prim::CallMethod[name="forward"](%1, %6)
#   return (%7)

print(jit_layer1.code)

# def forward(self,
#     argument_1: Tensor) -> Tensor:
#   _0 = getattr(self, "1")
#   _1 = (getattr(self, "0")).forward(argument_1, )
#   return (_0).forward(_1, )

# 调用inline pass，对graph做变换
torch._C._jit_pass_inline(jit_layer1.graph)
print(jit_layer1.code)

# def forward(self,
#     argument_1: Tensor) -> Tensor:
#   _0 = getattr(self, "1")
#   _1 = getattr(self, "0")
#   _2 = _1.bn2
#   _3 = _1.conv2
#   _4 = _1.bn1
#   input = torch._convolution(argument_1, _1.conv1.weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
#   _5 = _4.running_var
#   _6 = _4.running_mean
#   _7 = _4.bias
#   input0 = torch.batch_norm(input, _4.weight, _7, _6, _5, False, 0.10000000000000001, 1.0000000000000001e-05, True)
#   input1 = torch.relu_(input0)
#   input2 = torch._convolution(input1, _3.weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
#   _8 = _2.running_var
#   _9 = _2.running_mean
#   _10 = _2.bias
#   out = torch.batch_norm(input2, _2.weight, _10, _9, _8, False, 0.10000000000000001, 1.0000000000000001e-05, True)
#   input3 = torch.add_(out, argument_1, alpha=1)
#   input4 = torch.relu_(input3)
#   _11 = _0.bn2
#   _12 = _0.conv2
#   _13 = _0.bn1
#   input5 = torch._convolution(input4, _0.conv1.weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
#   _14 = _13.running_var
#   _15 = _13.running_mean
#   _16 = _13.bias
#   input6 = torch.batch_norm(input5, _13.weight, _16, _15, _14, False, 0.10000000000000001, 1.0000000000000001e-05, True)
#   input7 = torch.relu_(input6)
#   input8 = torch._convolution(input7, _12.weight, None, [1, 1], [1, 1], [1, 1], False, [0, 0], 1, False, False, True, True)
#   _17 = _11.running_var
#   _18 = _11.running_mean
#   _19 = _11.bias
#   out0 = torch.batch_norm(input8, _11.weight, _19, _18, _17, False, 0.10000000000000001, 1.0000000000000001e-05, True)
#   input9 = torch.add_(out0, input4, alpha=1)
#   return torch.relu_(input9)

除了 trace 之外，PyTorch 还提供了另一种生成 TorchScript 模型的方法：script。这种方式会直接解析网络定义的 python 代码，生成抽象语法树 AST，因此这种方法可以解决一些 trace 无法解决的问题，比如对 branch/loop 等数据流控制语句的建图。
不管是哪种方法创建的 TorchScript 都可以进行序列化，比如：

# 将模型序列化
jit_model.save('jit_model.pth')
# 加载序列化后的模型
jit_model = torch.jit.load('jit_model.pth')

ONNX 的导出，使用的正是 TorchScript 的 trace 工具。具体步骤如下：

使用 trace 的方式先生成一个 TorchScipt 模型，如果你转换的本身就是 TorchScript 模型，则可以跳过这一步。
使用许多 pass 对 1 中生成的模型进行变换，其中对 ONNX 导出最重要的一个 pass 就是ToONNX，这个 pass 会进行一个映射，将 TorchScript 中 prim、aten 空间下的算子映射到onnx空间下的算子。
使用 ONNX 的 proto 格式对模型进行序列化，完成 ONNX 的导出。

3. 推理引擎

ONNX Runtime 是由微软维护的一个跨平台机器学习推理加速器，也就是我们前面提到的”推理引擎“。ONNX Runtime 是直接对接 ONNX 的，即 ONNX Runtime 可以直接读取并运行 .onnx 文件。

import onnxruntime
ort_session = onnxruntime.InferenceSession("srcnn.onnx")
ort_inputs = {'input': input_img}
ort_output = ort_session.run(['output'], ort_inputs)[0]

推理器的 run 方法用于模型推理，其第一个参数为输出张量名的列表，第二个参数为输入值的字典。其中输入值字典的 key 为张量名，value 为 numpy 类型的张量值。输入输出张量的名称需要和 torch.onnx.export 中设置的输入输出名对应。

4. 完整例子

PointNet 系列的 Point-based 模型直接对点云进行处理，可以减少位置信息的损失，但同时也带来了巨大的计算资源消耗，使其很难做到实时。VoxelNet 等 Voxel-base 的模型相较于 Point-base 的模型在推理速度上有所提升，但是由于模型中使用了三维卷积的 backbone，所以也仍然很难做到实时。
为了解决点云目标检测模型的速度问题，nuTonomy 公司于 2019 年提出了 PointPillars 模型，现如今 PointPillars 已经是最常用于点云目标检测任务的模型之一。相较于其他的模型，PointPillars 在推理速度方面有着明显的优势（遥遥领先），同时又能保持着不错的准确性。整个网络结构可以分为三个部分：
Pillar Feature Net：将输入的点云转换为稀疏的伪图像的特征形式。
Backbone（2D CNN）：使用 2D 的 CNN 处理伪图像特征得到高维度的特征。
Detection Head（SSD）：检测和回归 3D 边界框。
在这里插入图片描述