八、ONNX Runtime的图优化方法说明

来源文档

ONNX Runtime的图优化方法 Graph Optimizations in ONNX Runtime

ONNX Runtime提供了各种图优化来改善模型性能。 图优化本质上是图级别的转换,包括小图简化、节点消除甚至是更复杂的节点融合和布局优化。
根据图的优化的复杂性和功能将其分为几类(或“级别”)。 它们可以在线离线执行。 在在线模式下,优化是在执行推理之前完成的;而在离线模式下,运行时会将优化后的图形保存到磁盘上。 ONNX Runtime提供Python,C#,C ++和C API,以启用不同的优化级别,并在脱机模式与在线模式之间进行选择。
下面,我们提供有关优化级别,在线/离线模式以及控制它们的各种API的详细信息。
ONNX Runtime provides various graph optimizations to improve model performance. Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations.

Graph optimizations are divided in several categories (or levels) based on their complexity and functionality. They can be performed either online or offline. In online mode, the optimizations are done before performing the inference, while in offline mode, the runtime saves the optimized graph to disk. ONNX Runtime provides Python, C#, C++, and C APIs to enable different optimization levels and to choose between offline vs. online mode.

Below we provide details on the optimization levels, the online/offline mode, and the various APIs to control them.

图优化级别Graph Optimization Levels

图形优化分为三个级别:

  • Basic 基础级别
  • Extended 扩展级别
  • Layout Optimizations 布局优化
    在应用当前级别的优化之前,会执行当前级别之前的优化(例如我们准备执行extended优化,Basic级别的优化会在执行extended优化之前先执行)。

默认情况下启用所有优化。

Graph optimizations are divided in three levels:

  • Basic
  • Extended
  • Layout Optimizations

The optimizations belonging to one level are performed after the optimizations of the previous level have been applied (e.g., extended optimizations are applied after basic optimizations have been applied).

All optimizations are enabled by default.

Basic图优化 Basic Graph Optimizations

这些是保留语义的图重写,可删除冗余节点和冗余计算。它们在大图分小图之前运行,因此适用于所有execution providers。可用的Basic图优化如下:

  • Constant常量折叠:静态地计算仅依赖常量初始化程序的图形部分。这样就无需在运行时计算它们。
  • 消除冗余节点:在不更改图形结构的情况下删除所有冗余节点。当前支持以下此类优化:
    • Identity Elimination
    • Slice Elimination
    • Unsqueeze Elimination
    • Dropout Elimination
  • 保留语义的节点融合:将多个节点融合/折叠为一个节点。例如,Conv Add融合会将Add运算符折叠为Conv运算符的偏差。当前支持以下此类优化:
    • Conv Add Fusion
    • Conv Mul Fusion
    • Conv BatchNorm Fusion
    • Relu Clip Fusion
    • Reshape Fusion

These are semantics-preserving graph rewrites which remove redundant nodes and redundant computation. They run before graph partitioning and thus apply to all the execution providers. Available basic graph optimizations are as follows:

  • Constant Folding: Statically computes parts of the graph that rely only on constant initializers. This eliminates the need to compute them during runtime.

  • Redundant node eliminations: Remove all redundant nodes without changing the graph structure. The following such optimizations are currently supported:

    • Identity Elimination
    • Slice Elimination
    • Unsqueeze Elimination
    • Dropout Elimination
  • Semantics-preserving node fusions : Fuse/fold multiple nodes into a single node. For example, Conv Add fusion folds the Add operator as the bias of the Conv operator. The following such optimizations are currently supported:

    • Conv Add Fusion
    • Conv Mul Fusion
    • Conv BatchNorm Fusion
    • Relu Clip Fusion
    • Reshape Fusion

Extended图优化 Extended Graph Optimizations

这些优化包括复杂的节点融合。它们在大图分小图之后运行,并且仅应用于分配给CPU或CUDA的execution providers的节点。可用的Extended图优化如下:

OptimizationExecution ProviderComment
GEMM Activation Fusioncpu
Matmul Add Fusioncpu
Conv Activation Fusioncpu
GELU Fusioncpu or cuda
Layer Normalization Fusioncpu or cuda
BERT Embedding Layer Fusioncpu or cudaFuse BERT embedding layer, layer normalization and attention mask length
Attention Fusioncpu or cudaAttention mask has approximation in cuda execution provider
Skip Layer Normalization Fusioncpu or cudaFuse bias of fully connected layer, skip connection and layer normalization
Bias GELU Fusioncpu or cudaFuse bias of fully connected layer and GELU activation
GELU ApproximationcudaErf is approximated by a formula using tanh function

为了优化BERT模型的推理性能,在cuda execution providers的GELU近似和Attention融合中使用了近似。 结果可能会略有不同。 根据我们的评估,可以忽略此种近似对准确性的影响:SQuAD v1.1上的BERT模型的F1得分几乎相同(87.05 vs 87.03)。

These optimizations include complex node fusions. They are run after graph partitioning and are only applied to the nodes assigned to the CPU or CUDA execution provider. Available extended graph optimizations are as follows:

OptimizationExecution ProviderComment
GEMM Activation Fusioncpu
Matmul Add Fusioncpu
Conv Activation Fusioncpu
GELU Fusioncpu or cuda
Layer Normalization Fusioncpu or cuda
BERT Embedding Layer Fusioncpu or cudaFuse BERT embedding layer, layer normalization and attention mask length
Attention Fusioncpu or cudaAttention mask has approximation in cuda execution provider
Skip Layer Normalization Fusioncpu or cudaFuse bias of fully connected layer, skip connection and layer normalization
Bias GELU Fusioncpu or cudaFuse bias of fully connected layer and GELU activation
GELU ApproximationcudaErf is approximated by a formula using tanh function

To optimize inference performance of BERT model, approximation is used in GELU approximation and Attention fusion for cuda execution provider. There might be slight difference in result. The impact on accuracy could be neglected based on our evaluation: F1 score for a BERT model on SQuAD v1.1 is almost same (87.05 vs 87.03).

Layout优化 Layout Optimizations

这些优化更改了适用节点的数据layout,以实现更高的性能改进。 它们在大图分小图之后运行,并且仅应用于分配给CPU execution providers的节点。 可用的布局优化如下:

  • NCHWc Optimizer: 使用NCHWc layout而不是NCHW layout.

These optimizations change the data layout for applicable nodes to achieve higher performance improvements. They are run after graph partitioning and are only applied to nodes assigned to CPU execution provider. Available layout optimizations are as follows:

  • NCHWc Optimizer: Optimizes the graph by using NCHWc layout instead of NCHW layout.

在线/离线模式选择 Online/Offline Mode

所有优化均可在线或离线模式下执行。 在在线模式下,当初始化inference session时,我们会在执行模型推理之前应用所有启用的图优化。 当然这有个显著的问题就是每次启动inference session时都应用所有优化,这样会增加模型初始化时间的开销(尤其是对于复杂模型),这在部署生产环境中可能会有非常大的影响。 但是离线模式下就可以带来避免初始化时间开销的优势, 离线模式下,执行图形优化后,ONNX Runtime会将生成的模型序列化到磁盘。 随后,当为此模型创建新的inference session时,我们可以改用已经优化的模型来减少启动时间。
Notes:

  • 在离线模式下运行时,请确保使用与模型推理将在其上运行的目标计算机完全相同的配置项(例如,execution providers,optimization level 优化等级)和硬件(例如,你不能在仅配备CPU的计算机上运行针对GPU的 execution providers 预先优化的模型)。
  • 启用layout优化后,保存离线模型时,只能在与环境兼容的硬件上使用离线模式。 例如,如果模型具有针对AVX2优化的布局,则离线模型将需要支持AVX2的CPU。

All optimizations can be performed either online or offline. In online mode, when initializing an inference session, we also apply all enabled graph optimizations before performing model inference. Applying all optimizations each time we initiate a session can add overhead to the model startup time (especially for complex models), which can be critical in production scenarios. This is where the offline mode can bring a lot of benefit. In offline mode, after performing graph optimizations, ONNX Runtime serializes the resulting model to disk. Subsequently, when new inference sessions are created for this model, we can instead use the already optimized model to reduce startup time.

Notes:

  • When running in offline mode, make sure to use the exact same options (e.g., execution providers, optimization level) and hardware as the target machine that the model inference will run on (e.g., you cannot run a model pre-optimized for a GPU execution provider on a machine that is equipped only with CPU).
  • When layout optimizations are enabled, the offline mode can only be used on compatible hardware to the environment when the offline model is saved. For example, if model has layout optimized for AVX2, the offline model would require CPUs that support AVX2.

使用说明 Usage

通用方法说明 General Note

Levels:
ONNX运行时定义了“ GraphOptimizationLevel”枚举,以确定将启用上述哪些优化级别。 选择一个级别将启用该级别的优化,以及所有先前级别的优化。 例如,启用扩展优化,也将启用基本优化。 这些级别到枚举的映射如下:

  • GraphOptimizationLevel::ORT_DISABLE_ALL -> 取消所有的 optimizations
  • GraphOptimizationLevel::ORT_ENABLE_BASIC -> 使能 basic optimizations
  • GraphOptimizationLevel::ORT_ENABLE_EXTENDED -> 使能 basic and extended optimizations
  • GraphOptimizationLevel::ORT_ENABLE_ALL -> 使能all available optimizations including layout optimizations

Online/Offline Mode:
要启用优化模型到磁盘的序列化,请将SessionOptions选项optimized_model_path设置为要存储优化模型的所需路径。

Levels:
ONNX Runtime defines the GraphOptimizationLevel enum to determine which of the aforementioned optimization levels will be enabled. Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. For example, enabling Extended optimizations, also enables Basic optimizations. The mapping of these levels to the enum is as follows:

  • GraphOptimizationLevel::ORT_DISABLE_ALL -> Disables all optimizations
  • GraphOptimizationLevel::ORT_ENABLE_BASIC -> Enables basic optimizations
  • GraphOptimizationLevel::ORT_ENABLE_EXTENDED -> Enables basic and extended optimizations
  • GraphOptimizationLevel::ORT_ENABLE_ALL -> Enables all available optimizations including layout optimizations

Online/Offline Mode:
To enable serialization of the optimized model to disk, set the SessionOptions option optimized_model_path to the desired path where the optimized model will be stored.

Python API Usage

import onnxruntime as rt

sess_options = rt.SessionOptions()

# Set graph optimization level
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_EXTENDED

# To enable model serialization after graph optimization set this
sess_options.optimized_model_filepath = "<model_output_path\optimized_model.onnx>"

session = rt.InferenceSession("<model_path>", sess_options)

C API Example:

  const OrtApi* Ort::g_api = OrtGetApi(ORT_API_VERSION);
  OrtEnv* env;
  g_ort->CreateEnv(ORT_LOGGING_LEVEL_WARNING, "test", &env);
  OrtSessionOptions* session_options;
  g_ort->CreateSessionOptions(&session_options)

  // Set graph optimization level
  g_ort->SetSessionGraphOptimizationLevel(session_options, ORT_ENABLE_EXTENDED);

  // To enable model serialization after graph optimization set this
  const wchar_t* optimized_model_path = L"optimized_model_path";
  g_ort->SetOptimizedModelFilePath(session_options, optimized_model_path);

  OrtSession* session;
  const wchar_t* model_path = L"model_path";
  g_ort->CreateSession(env, model_path, session_option, &session);

C# API Example:

SessionOptions so = new SessionOptions();

// Set graph optimization level
so.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_EXTENDED;

// To enable model serialization after graph optimization set this
so.OptimizedModelFilePath = "model_output_path\optimized_model.onnx"

var session = new InferenceSession(modelPath, so);

C++ API Example:

Ort::SessionOptions session_options;

// Set graph optimization level
session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);

// To enable model serialization after graph optimization set this
session_options.SetOptimizedModelFilePath("optimized_file_path");

auto session_ = Ort::Session(env, "model_file_path", session_options);
<think>嗯,用户想了解ONNX Runtime的性能优化,我得先回想一下相关知识点。ONNX Runtime是微软推出的推理框架,支持多平台,性能优化方面应该涉及多个层面。首先,我应该从基础开始,比如执行提供器,不同的硬件加速选项,比如CUDA、TensorRT、OpenVINO这些。用户可能想知道如何选择最适合自己环境的提供器。 然后,图优化部分也很重要,ONNX Runtime会自动进行节点融合、常量折叠之类的优化,可能需要举例说明,比如将Conv和BatchNorm合并,减少计算量。还有量化,这是常用的优化方法,特别是INT8量化,能减少模型大小和提升推理速度,但要注意精度损失。 输入输出优化方面,使用IoBinding避免数据拷贝,特别是GPU场景下,这点可能容易被忽略。内存和线程的配置,比如设置合适的线程数,避免资源竞争,这些参数调整对性能影响挺大的。 动态形状处理可能是个难点,如果模型需要处理可变输入尺寸,使用符号执行或提前固定形状可能有帮助。性能分析工具像ONNX Runtime的Profiling API,帮助定位瓶颈,用户可能需要具体步骤来使用这些工具。 另外,模型本身的优化,比如导出ONNX时选择优化选项,或者使用SparseML进行稀疏化处理,这些高级技巧可能对进阶用户有用。需要分步骤解释,确保结构清晰,用户能一步步跟着操作。 可能用户的实际应用场景是什么?比如部署在服务器还是移动端,不同的环境优化策略不同。需要提醒用户根据具体情况调整,比如移动端推荐使用NNAPI或CoreML。还要注意版本更新,保持ONNX Runtime为最新版,以获取性能改进。 需要避免使用专业术语过多,但又要准确。比如解释图优化时,用节点融合这样的例子更易懂。量化部分要说明优缺点,防止用户盲目应用导致精度问题。最后,总结关键点,帮助用户快速回顾。可能还需要给出代码片段示例,比如设置执行提供器或启用量化,这样用户更容易实践。</think>好的,我将从优化原理、关键步骤和实用技巧三个层面,逐步介绍ONNX Runtime的性能优化方法: 1. **执行提供器(Execution Provider)选择** - **硬件加速匹配**:根据设备选择最优后端 ```python # CUDA加速(NVIDIA GPU) sess = ort.InferenceSession("model.onnx", providers=['CUDAExecutionProvider']) # TensorRT加速(需单独安装) sess = ort.InferenceSession("model.onnx", providers=['TensorrtExecutionProvider']) # CPU优化(适用于Intel处理器) sess = ort.InferenceSession("model.onnx", providers=['OpenVINOExecutionProvider']) ``` - **多EP组合**:支持优先级配置,如`['CUDAExecutionProvider', 'CPUExecutionProvider']` 2. **图优化(Graph Optimization)** - **自动优化级别**: ```python # 开启所有优化(推荐) sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL ``` - **典型优化策略**: - 节点融合:如Conv+BN+ReLU合并为单节点 - 常量折叠:提前计算静态张量 - 冗余节点消除:去除无用Identity/Transpose操作 3. **量化加速** - **动态量化**(运行时量化): ```python from onnxruntime.quantization import quantize_dynamic quantize_dynamic("model.onnx", "model_quant.onnx") ``` - **静态量化**(需校准数据): ```python from onnxruntime.quantization import QuantType, StaticQuantConfig config = StaticQuantConfig(calibration_data_reader, activation_type=QuantType.QUInt8) ``` 4. **输入/输出优化** - **IoBinding高效传输**(避免CPU-GPU拷贝): ```python io_binding = sess.io_binding() io_binding.bind_input('input', 'cuda', 0, np.float32, input_shape, input_data.data_ptr()) io_binding.bind_output('output', 'cuda') sess.run_with_iobinding(io_binding) ``` 5. **线程与内存优化** - **并行配置**: ```python sess_options.intra_op_num_threads = 4 # 算子内并行 sess_options.inter_op_num_threads = 2 # 算子间并行 ``` - **内存复用**: ```python sess_options.enable_mem_pattern = False # 动态shape时禁用 sess_options.enable_cpu_mem_arena = True # CPU内存池 ``` 6. **动态shape处理** - **符号化维度**(适用于可变输入): ```python from onnxruntime import SessionOptions, GraphOptimizationLevel sess_options = SessionOptions() sess_options.add_free_dimension_override_by_name("batch_size", 4) # 动态batch设置 ``` 7. **性能分析工具** - **内置Profiler**: ```python sess_options.enable_profiling = True # 运行后生成timestamp.json profiler_file = sess.end_profiling() ``` **优化效果对比**(典型场景): | 优化方法 | 延迟降低 | 内存节省 | 适用场景 | |---------|---------|---------|---------| | CUDA EP | 60-80% | - | NVIDIA GPU | | TensorRT | 额外20% | 30% | 稳定输入shape | | INT8量化 | 2-4x | 50% | 精度容忍场景 | | IoBinding | 10-15% | 20% | 高频小数据 | **最佳实践建议**: 1. 优先使用最新版本(性能改进持续更新) 2. 对图像类模型启用`NHWC`布局 3. 使用`onnxruntime_tools`进行模型精简: ```bash python -m onnxruntime_tools.optimizer_cli --input model.onnx --output optimized.onnx ``` 4. 对于移动端部署,启用NNAPI(Android)或CoreML(iOS) 通过上述方法组合使用,典型业务场景可实现3-5倍的端到端性能提升。建议通过性能分析工具定位瓶颈后针对性优化
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值