深度学习系列——onnx, onnxruntime, netron_onnxstream onnxruntime-CSDN博客

系列文章目录

onnx

Official: https://onnx.ai/
Github: https://github.com/onnx/onnx
zhihu: https://zhuanlan.zhihu.com/p/346511883

onnxruntime

Official: https://onnxruntime.ai/
Github: https://github.com/microsoft/onnxruntime
zhihu: https://zhuanlan.zhihu.com/p/582974246

netron

Official: https://netron.app/
Github: https://github.com/lutzroeder/netron
zhihu: https://zhuanlan.zhihu.com/p/431445882

环境搭建：onnx模型部署onnxruntime-gpu安装与测试（python）

https://blog.csdn.net/qq_40541102/article/details/130086491
for short:

# create a conda environment
conda create -p .conda python=3.10
# install pytorch based on cuda 11.3
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
# install cudnn
conda install cudnn==8.2.1
# install onnxruntime-gpu
pip install onnxruntime-gpu
# install onnx
conda install onnx

1. onnxruntime 安装

onnx 模型在 CPU 上进行推理，在conda环境中直接使用pip安装即可

pip install onnxruntime

2. onnxruntime-gpu 安装

想要 onnx 模型在 GPU 上加速推理，需要安装 onnxruntime-gpu 。有两种思路：

依赖于本地主机上已安装的 cuda 和 cudnn 版本
不依赖于本地主机上已安装的 cuda 和 cudnn 版本

要注意：onnxruntime-gpu, cuda, cudnn三者的版本要对应，否则会报错或不能使用GPU推理。
onnxruntime-gpu, cuda, cudnn版本对应关系详见: 官网

2.1 方法一：onnxruntime-gpu依赖于本地主机上cuda和cudnn

查看已安装 cuda 和 cudnn 版本

# cuda version
cat /usr/local/cuda/version.txt

# cudnn version
cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

根据 onnxruntime-gpu, cuda, cudnn 三者对应关系，安装相应的 onnxruntime-gpu 即可。

## cuda==10.2
## cudnn==8.0.3
## onnxruntime-gpu==1.5.0 or 1.6.0
pip install onnxruntime-gpu==1.6.0

2.2 方法二：onnxruntime-gpu不依赖于本地主机上cuda和cudnn

在 conda 环境中安装，不依赖于本地主机上已安装的 cuda 和 cudnn 版本，灵活方便。这里，先说一下已经测试通过的组合：

python3.6, cudatoolkit10.2.89, cudnn7.6.5, onnxruntime-gpu1.4.0
python3.8, cudatoolkit11.3.1, cudnn8.2.1, onnxruntime-gpu1.14.1

如果需要其他的版本，可以根据 onnxruntime-gpu, cuda, cudnn 三者对应关系自行组合测试。

下面，从创建conda环境，到实现在GPU上加速onnx模型推理进行举例。

2.2.1 举例：创建onnxruntime-gpu==1.14.1的conda环境

## 创建conda环境
conda create -n torch python=3.8

## 激活conda环境
source activate torch
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
conda install cudnn==8.2.1
pip install onnxruntime-gpu==1.14.1
## pip install ... (根据需求，安装其他的包)

2.2.2 举例：实例测试

打开终端，输入 watch -n 0.1 nvidia-smi, 实时查看gpu使用情况

请添加图片描述

代码测试，摘取API

  import numpy as np
  import torch
  import onnxruntime
  
  MODEL_FILE = '.model.onnx'
  DEVICE_NAME = 'cuda' if torch.cuda.is_available() else 'cpu'
  DEVICE_INDEX = 0
  DEVICE=f'{DEVICE_NAME}:{DEVICE_INDEX}'
  
  # A simple model to calculate addition of two tensors
  def model():
      class Model(torch.nn.Module):
          def __init__(self):
              super(Model, self).__init__()
  
          def forward(self, x, y):
              return x.add(y)
  
      return Model()
  
  # Create an instance of the model and export it to ONNX graph format
  def create_model(type: torch.dtype = torch.float32):
      sample_x = torch.ones(3, dtype=type)
      sample_y = torch.zeros(3, dtype=type)
      torch.onnx.export(model(), (sample_x, sample_y), MODEL_FILE,
                        input_names=["x", "y"], output_names=["z"], 
                        dynamic_axes={"x":{0 : "array_length_x"}, "y":{0: "array_length_y"}})
   
  # Create an ONNX Runtime session with the provided model
  def create_session(model: str) -> onnxruntime.InferenceSession:
      providers = ['CPUExecutionProvider']
      if torch.cuda.is_available():
          providers.insert(0, 'CUDAExecutionProvider')
      return onnxruntime.InferenceSession(model, providers=providers)
  
  # Run the model on CPU consuming and producing numpy arrays 
  def run(x: np.array, y: np.array) -> np.array:
      session = create_session(MODEL_FILE)
      z = session.run(["z"], {"x": x, "y": y})
      return z[0]   

  def main():
      create_model()
      print(run(x=np.float32([1.0, 2.0, 3.0]),y=np.float32([4.0, 5.0, 6.0])))
  
  if __name__ == "__main__":
      main()

推理模型部署(一)：ONNX runtime 实践

https://zhuanlan.zhihu.com/p/582974246
简单来说，对于机器学习模型过程可分为训练迭代和部署上线两个方面：

训练迭代，即通过特定的数据集、模型结构、损失函数和评价指标的确定，到模型参数的训练，以尽可能达到SOTA(State of the Art)的结果。
部署上线，即指让训练好的模型在特定环境中运行的过程，更多关注于部署场景、部署方式、吞吐率和延迟。
在实际场景中，深度学习模型通常通过PyTorch、TensorFlow等框架来完成，直接通过这些模型来进行推理效率并不高，特别是对延时要求严格的线上场景。由此，经过工业界和学术界数年的探索，模型部署有了一条流行的流水线：
请添加图片描述
这一条流水线解决了模型部署中的两大问题：使用对接深度学习框架和推理引擎的中间表示，开发者不必担心如何在新环境中运行各个复杂的框架；通过中间表示的网络结构优化和推理引擎对运算的底层优化，模型的运算效率大幅提升。

接下来，我们将通过一步步的实践来体验模型部署的过程。

1. ONNX 面面观

ONNX （Open Neural Network Exchange）是 Facebook 和微软在2017年共同发布的，用于标准描述计算图的一种格式。ONNX 已经对接了多种深度学习框架(如Tensorflow, PyTorch, Scikit-learn， MXNet等)和多种推理引擎。因此，ONNX 被当成了深度学习框架到推理引擎的桥梁，就像编译器的中间语言一样。由于各框架兼容性不一，我们通常只用 ONNX 表示更容易部署的静态图。

ONNX 的文件格式

ONNX文件是基于Protobuf进行序列化。从onnx.proto3协议中我们需要重点知道的数据结构如下：

ModelProto：模型的定义，包含版本信息，生产者和GraphProto。
GraphProto: 包含很多重复的NodeProto, initializer, ValueInfoProto等，这些元素共同构成一个计算图，在GraphProto中，这些元素都是以列表的方式存储，连接关系是通过Node之间的输入输出进行表达的。
NodeProto: onnx的计算图是一个有向无环图(DAG)，NodeProto定义算子类型，节点的输入输出，还包含属性。
ValueInforProto: 定义输入输出这类变量的类型。
TensorProto: 序列化的权重数据，包含数据的数据类型，shape等。
AttributeProto: 具有名字的属性，可以存储基本的数据类型(int, float, string, vector等)也可以存储onnx定义的数据结构(TENSOR, GRAPH等)。

构建一个简单的ONNX图

下面我们将通过onnx的语法构造一个简单的ONNX模型:

首先，通过helper.make_tensor_value_info构造出描述输入和输出张量信息的ValueInfoProto对象。要传入张量名、张量的基本数据类型、张量形状这三个信息。
然后，构造算子节点信息NodeProto，通过在helper.make_node中传入算子类型、输入张量名、输出张量名这三个信息来实现。需要说明的是，ONNX把边的信息保存在了节点信息里，省去了保存边集的步骤。在ONNX中，如果某节点的输入名和之前某节点的输出名相同，就默认这两个节点是相连的。如下面的例子：Mul节点定义了输出c，Add节点定义了输入c，则Mul节点和Add节点是相连的。
接下来，通过helper.make_graph来构造计算图GraphProto。helper.make_graph函数需要传入节点、图名称、输入张量信息、输出张量信息这4个参数。如下面的代码所示，我们把之前构造出来的NodeProto对象和ValueInfoProto对象按照顺序传入即可。
最后，通过helper.make_model把计算图GraphProto封装进模型ModelProto里，一个ONNX模型就构造完成了。make_model函数中还可以添加模型制作者、版本等信息。

构造完模型之后，需要检查模型正确性、把模型以文本形式输出、存储到一个".onnx"文件里，并加载该文件。这里用onnx.checker.check_model来检查模型是否满足ONNX标准是必要的，用 onnx.save 和 onnx.load 来存储及加载模型。

import onnx
from onnx import helper
from onnx import TensorProto

# input and output
a = helper.make_tensor_value_info('a', TensorProto.FLOAT, [10, 10])
x = helper.make_tensor_value_info('x', TensorProto.FLOAT, [10, 10])
b = helper.make_tensor_value_info('b', TensorProto.FLOAT, [10, 10])
output = helper.make_tensor_value_info('output', TensorProto.FLOAT, [10, 10])

# Mul
mul = helper.make_node('Mul', ['a', 'x'], ['c'])

# Add
add = helper.make_node('Add', ['c', 'b'], ['output'])

# graph and model
graph = helper.make_graph([mul, add], 'linear_func', [a, x, b], [output])
model = helper.make_model(graph)

# save model
onnx.checker.check_model(model)
onnx.save(model, 'linear.onnx')

model_ = onnx.load('linear.onnx')
print(model_)

下面便是打印出来的ONNX模型的完整信息：

ir_version: 8
graph {
  node {
    input: "a"
    input: "x"
    output: "c"
    op_type: "Mul"
  }
  node {
    input: "c"
    input: "b"
    output: "output"
    op_type: "Add"
  }
  name: "linear_func"
  input {
    name: "a"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  input {
    name: "x"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  input {
    name: "b"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
  output {
    name: "output"
    type {
      tensor_type {
        elem_type: 1
        shape {
          dim {
            dim_value: 10
          }
          dim {
            dim_value: 10
          }
        }
      }
    }
  }
}
opset_import {
  version: 17
}

同时，我们可以通过Netron 查看ONNX模型结构：
请添加图片描述
接下来，需要用ONNX runtime验证结果的正确性：

import onnxruntime
import numpy as np

sess = onnxruntime.InferenceSession('linear.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
a = np.random.rand(10, 10).astype(np.float32)
b = np.random.rand(10, 10).astype(np.float32)
x = np.random.rand(10, 10).astype(np.float32)

output = sess.run(['output'], {'a': a, 'b': b, 'x': x})[0]

assert np.allclose(output, a * x + b)

另外，我们还可以对读取出来的模型进行修改，如：

import onnx
model = onnx.load('linear.onnx')

node = model.graph.node
node[1].op_type = 'Sub'

onnx.checker.check_model(model)
onnx.save(model, 'linear_2.onnx')

这样就把 a * x + b 改成了 a * x - b 。

Torch2ONNX

TorchScript是一种序列化和优化 PyTorch 模型的格式，在优化过程中，一个torch.nn.Module模型会被转换成 TorchScript 的torch.jit.ScriptModule模型。

torch.onnx.export中需要的模型实际上是一个torch.jit.ScriptModule。而要把普通 PyTorch 模型转一个这样的 TorchScript 模型，有两种方式：

Tracing: If torch.onnx.export() is called with a Module that is not already a ScriptModule, it first does the equivalent of torch.jit.trace(), which executes the model once with the given args and records all operations that happen during that execution. This means that if your model is dynamic, e.g., changes behavior depending on input data, the exported model will not capture this dynamic behavior. We recommend examining the exported model and making sure the operators look reasonable. Tracing will unroll loops and if statements, exporting a static graph that is exactly the same as the traced run. If you want to export your model with dynamic control flow, you will need to use scripting.
Scripting: Compiling a model via scripting preserves dynamic control flow and is valid for inputs of different sizes. To use scripting:
- Use torch.jit.script() to produce a ScriptModule.
- Call torch.onnx.export() with the ScriptModule as the model. The args are still required, but they will be used internally only to produce example outputs, so that the types and shapes of the outputs can be captured. No tracing will be performed.

下面通过一个例子来说二者的区别：

在这段代码里，我们定义了一个带循环的模型，模型通过参数n来控制输入张量被卷积的次数。之后，我们各创建了一个n=2和n=3的模型。我们把这两个模型分别用tracing和scripting的方法进行导出。

import torch

class Model(torch.nn.Module):
    def __init__(self, n):
        super().__init__()
        self.n = n
        self.conv = torch.nn.Conv2d(3, 3, 3)

    def forward(self, x):
        for i in range(self.n):
            x = self.conv(x)
        return x


models = [Model(2), Model(3)]
model_names = ['model_2', 'model_3']

for model, model_name in zip(models, model_names):
    dummy_input = torch.rand(1, 3, 10, 10)
    dummy_output = model(dummy_input)
    model_trace = torch.jit.trace(model, dummy_input)
    model_script = torch.jit.script(model)

    # 跟踪法与直接 torch.onnx.export(model, ...)等价
    torch.onnx.export(model_trace, dummy_input, f'{model_name}_trace.onnx')
    # 记录法必须先调用 torch.jit.sciprt
    torch.onnx.export(model_script, dummy_input, f'{model_name}_script.onnx')

对于 tracing 方式导出的模型，在执行torch.jit.trace的时候就运行过一遍了，因此 n 不同时模型结构是不同的。

请添加图片描述

2. ONNX runtime 运行 BERT

ONNX Runtime是由微软维护的一个跨平台机器学习推理加速器，即”推理引擎“。ONNX Runtime 是直接对接 ONNX 的，即 ONNX Runtime 可以直接读取并运行 .onnx 文件, 而不需要再把 .onnx 格式的文件转换成其他格式的文件。也就是说，对于 PyTorch -> ONNX -> ONNX Runtime 这条部署流水线，只要在目标设备中得到 .onnx 文件，并在 ONNX Runtime 上运行模型，模型部署就算大功告成了。

下面我们将通过ONNX Runtime来运行BERT模型。

2.1 加载数据与模型

import torch
import onnx
import onnxruntime
import transformers
import os

# Whether allow overwriting existing ONNX model and download the latest script from GitHub
enable_overwrite = True
# Total samples to inference, so that we can get average latency
total_samples = 1000
# ONNX opset version
opset_version=11

cache_dir = "./squad"
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

predict_file_url = "https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json"
predict_file = os.path.join(cache_dir, "dev-v1.1.json")
if not os.path.exists(predict_file):
    import wget
    print("Start downloading predict file.")
    wget.download(predict_file_url, predict_file)
    print("Predict file downloaded.")

model_name_or_path = "bert-large-uncased-whole-word-masking-finetuned-squad"
max_seq_length = 128
doc_stride = 128
max_query_length = 64

from transformers import (BertConfig, BertForQuestionAnswering, BertTokenizer)

# Load pretrained model and tokenizer
config_class, model_class, tokenizer_class = (BertConfig, BertForQuestionAnswering, BertTokenizer)
config = config_class.from_pretrained(model_name_or_path, cache_dir=cache_dir)
tokenizer = tokenizer_class.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = model_class.from_pretrained(model_name_or_path, from_tf=False, config=config, cache_dir=cache_dir)
# load some examples
from transformers.data.processors.squad import SquadV1Processor

processor = SquadV1Processor()
examples = processor.get_dev_examples(None, filename=predict_file)

from transformers import squad_convert_examples_to_features
features, dataset = squad_convert_examples_to_features( 
            examples=examples[:total_samples], # convert enough examples for this notebook
            tokenizer=tokenizer,
            max_seq_length=max_seq_length,
            doc_stride=doc_stride,
            max_query_length=max_query_length,
            is_training=False,
            return_dataset='pt'
        )

2.2 导出ONNX模型

output_dir = "./onnx"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)   
export_model_path = os.path.join(output_dir, 'bert-base-cased-squad_opset{}.onnx'.format(opset_version))

import torch
use_gpu = torch.cuda.is_available()
device = torch.device("cuda" if use_gpu else "cpu")

# Get the first example data to run the model and export it to ONNX
data = dataset[0]
inputs = {
    'input_ids':      data[0].to(device).reshape(1, max_seq_length),
    'attention_mask': data[1].to(device).reshape(1, max_seq_length),
    'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
}

# Set model to inference mode, which is required before exporting the model because some operators behave differently in 
# inference and training mode.
model.eval()
model.to(device)

if enable_overwrite or not os.path.exists(export_model_path):
    with torch.no_grad():
        symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
        torch.onnx.export(model,                                            # model being run
                          args=tuple(inputs.values()),                      # model input (or a tuple for multiple inputs)
                          f=export_model_path,                              # where to save the model (can be a file or file-like object)
                          opset_version=opset_version,                      # the ONNX version to export the model to
                          do_constant_folding=True,                         # whether to execute constant folding for optimization
                          input_names=['input_ids',                         # the model's input names
                                       'input_mask', 
                                       'segment_ids'],
                          output_names=['start', 'end'],                    # the model's output names
                          dynamic_axes={'input_ids': symbolic_names,        # variable length axes
                                        'input_mask' : symbolic_names,
                                        'segment_ids' : symbolic_names,
                                        'start' : symbolic_names,
                                        'end' : symbolic_names})
        print("Model exported at ", export_model_path)

2.3 PyTorch 推理

首先在PyTorch 测出推理精度及延时的基准值

import time

# Measure the latency. It is not accurate using Jupyter Notebook, it is recommended to use standalone python script.
latency = []
with torch.no_grad():
    for i in range(total_samples):
        data = dataset[i]
        inputs = {
            'input_ids':      data[0].to(device).reshape(1, max_seq_length),
            'attention_mask': data[1].to(device).reshape(1, max_seq_length),
            'token_type_ids': data[2].to(device).reshape(1, max_seq_length)
        }
        start = time.time()
        outputs = model(**inputs)
        latency.append(time.time() - start)
print("PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))

在一张A10卡上，PyTorch 用时：PyTorch cuda Inference time = 12.33 ms

2.4 使用 ONNX runtime 推理

import psutil
import onnxruntime
import numpy

assert 'CUDAExecutionProvider' in onnxruntime.get_available_providers()
device_name = 'gpu'

sess_options = onnxruntime.SessionOptions()

# Optional: store the optimized graph and view it using Netron to verify that model is fully optimized.
# Note that this will increase session creation time so enable it for debugging only.
sess_options.optimized_model_filepath = os.path.join(output_dir, "optimized_model_{}.onnx".format(device_name))

# Please change the value according to best setting in Performance Test Tool result.
sess_options.intra_op_num_threads=psutil.cpu_count(logical=True)

session = onnxruntime.InferenceSession(export_model_path, sess_options,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider'])
#  onnxruntime.InferenceSession(onnx_path,providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])


latency = []
for i in range(total_samples):
    data = dataset[i]
    ort_inputs = {
        'input_ids':  data[0].cpu().reshape(1, max_seq_length).numpy(),
        'input_mask': data[1].cpu().reshape(1, max_seq_length).numpy(),
        'segment_ids': data[2].cpu().reshape(1, max_seq_length).numpy()
    }
    start = time.time()
    ort_outputs = session.run(None, ort_inputs)
    latency.append(time.time() - start)
    
print("OnnxRuntime {} Inference time = {} ms".format(device_name, format(sum(latency) * 1000 / len(latency), '.2f')))

在一张A10卡上，ONNX runtime 用时：OnnxRuntime gpu Inference time = 7.05 ms

性能提高了约75%。

与此同时，还需要验证精度：

print("***** Verifying correctness *****")
for i in range(2):    
    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-02, atol=1e-02))
    diff = ort_outputs[i] - outputs[i].cpu().numpy()
    max_diff = numpy.max(numpy.abs(diff))
    avg_diff = numpy.average(numpy.abs(diff))
    print(f'maximum_diff={max_diff} average_diff={avg_diff}')

结果如下：

***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
maximum_diff=0.002591252326965332 average_diff=0.0004398506134748459
PyTorch and ONNX Runtime output 1 are close: True
maximum_diff=0.0033492445945739746 average_diff=0.00040213397005572915

精度在误差范围内，以上就是一个完整的ONNX runtime运行实际模型的例子。