openvino系列 9. 训练后优化工具 Post-training Optimization Tool (POT) Simplified Mode 案例
本章节介绍英特尔 OpenVINO Post-training Optimization Tool 使用简化模式(Simplified Mode)进行低精度优化(Low Precision Optimization)的案例。
环境描述:
- 本案例运行环境:Win10,10代i5笔记本
- IDE:VSCode
- openvino版本:2022.1
- 代码链接,
5-pot-int8-simplifiedmode
文章目录
在解释代码之前,需要先解释两个概念:OpenVINO 训练后优化工具(Post-Training Optimization Tool),以及OpenVINO低精度优化(Low Precision Optimization)。
1 训练后优化工具 Post-Training Optimization Tool(POT)
训练后优化工具 (POT) 旨在通过应用无需模型重新训练或微调的特殊方法来加速深度学习模型的推理(英特尔没有开源,所以我们就把它当成一个黑箱吧)。因此,该工具不需要训练数据集或管道。要应用POT,我们需要:
- 一个浮点精度的模型,比如FP32或者FP16,这个模型可以被转化成OpenVINO的IR格式,然后在CPU上运行;
- 代表用例场景的代表性校准数据集,例如 300 张图像。
该工具旨在完全自动化模型转换过程,而无需在用户端更改模型。需要注意的是,这个POT是英特尔针对CPU进行的模型优化工具。从benchmarking网页。
下图是英伟达对比了四款CPU,在使用了POT之后,模型推理速度普遍得到提升:
下图是英伟达对比了四款CPU,在使用了POT之后,对比原本的全精度模型,精度有略微地下降:
关于POT的官方描述见此链接。
2 低精度优化 Low Precision Optimization with Simplified Mode
低精度意味着深度学习模型的推理精度低于32或16位(FLOAT32 和 FLOAT16),例如INT8 (UINT8)。低精度模型的有点在于大大加快推理速度,但可能模型Accuracy可能会出现一些下降。这些模型由量化模型表示,即以浮点精度训练的模型,然后通过层之间的浮点/定点量化操作转换为整数表示。
关于低精度模型优化流程,英伟达提出了一个与其他DL框架一致的通用工作流程。 它包含两个主要部分:训练后量化(post-training quantization)和量化感知训练(Quantization-Aware Training,QAT)。第一个组件是获得优化模型的最简单方法,当第一个组件不能给出准确的结果时,后者可以被视为替代或补充。
下图显示了使用 OpenVINO 和相关工具的新模型的优化流程。
- 第0步:模型启用。在这一步中,我们应该确保在目标数据集上训练的模型可以以浮点精度使用OpenVINO推理引擎。此过程涉及使用模型优化器工具将模型从原框架(比如Tensorflow,PyTorch训练完的模型)转换为 OpenVINO IR 中间件,并使用推理引擎在CPU上运行它。
- 第1步:POT。(当前英特尔官方)建议使用 POT 的 INT8 量化,在大多数情况下可以获得准确的量化模型。在这一步,我们不需要重新训练模型。 唯一需要的是一个具有代表性的数据集,它通常是数百张图像,用于在量化过程中收集统计数据。训练后量化也非常快,通常需要几分钟,具体取决于模型大小和使用的硬件。而且,一般来说,一个普通的桌面系统足以量化大部分 OpenVINO Model Zoo。
- 第2步:Quantization-Aware Training。这一步是可选的,如果第1步没有问题,那么一般不需要走这一步。
关于低精度优化的官方描述见此链接。
比较 POT API 以及简化模式(Simplified Mode):我们从上一章节中可以大致了解,其实POT可以说是在CPU端加速推理的一个工具。那么,怎么使用这个工具呢?这里有两种方式。一种是简化模式(Simplified Mode),就是说,我们不需要做什么设置,一个全精度模型进去,INT8精度模型出来,几行代码搞定(即我们这个案例尝试的)。另外一种方式就是这里要介绍的POT API,这个模式允许自定义优化管道,我们在6_pot_objectdetection
对其进行详细说明。
最后,本案例包括以下步骤:
- 下载和保存 CIFAR10 数据集
- 准备IR模型
- 压缩准备好的模型
- 测量和比较原始模型和量化模型的性能
- 演示使用量化模型进行图像分类
3 下载和保存 CIFAR10 数据集
首先,我们需要准备准备校准数据集,这里,我们从 Torchvision.datasets 下载 CIFAR10 数据集,然后将此数据集中选定数量的元素保存为单独文件夹中的 .png 图像。相关代码:
import os
from pathlib import Path
import warnings
import torch
from torchvision import transforms as T
from torchvision.datasets import CIFAR10
import matplotlib.pyplot as plt
import numpy as np
from openvino.runtime import Core, Tensor
warnings.filterwarnings("ignore")
# Set the data and model directories
MODEL_DIR = 'model'
CALIB_DIR = 'calib'
CIFAR_DIR = 'cifar'
CALIB_SET_SIZE = 300
MODEL_NAME = 'resnet20'
os.makedirs(MODEL_DIR, exist_ok=True)
os.makedirs(CALIB_DIR, exist_ok=True)
transform = T.Compose([T.ToTensor()])
dataset = CIFAR10(root=CIFAR_DIR, train=False, transform=transform, download=True)
pil_converter = T.ToPILImage(mode="RGB")
for idx, info in enumerate(dataset):
im = info[0]
if idx >= CALIB_SET_SIZE:
break
label = info[1]
pil_converter(im.squeeze(0)).save(Path(CALIB_DIR) / f'{label}_{idx}.png')
3 准备IR模型
模型准备包括以下步骤:
- 从 Torchvision 下载 PyTorch 模型,
- 将模型转换为 ONNX 格式,
- 运行 OpenVINO 模型优化器工具将 ONNX 转换为 OpenVINO IR 中间件
需要注意的是,我们在量化模型之前,需要准备好IR模型。如果当前的模型是PyTorch或者TensorFlow或者其他格式,需要先转化成IR模型。在这个案例中,我们先下载PyTorch模型,然后转化成ONNX格式模型,最后再转化成IR模型。
相关代码:
model = torch.hub.load("chenyaofo/pytorch-cifar-models", "cifar10_resnet20", pretrained=True)
dummy_input = torch.randn(1, 3, 32, 32)
onnx_model_path = Path(MODEL_DIR) / '{}.onnx'.format(MODEL_NAME)
torch.onnx.export(model, dummy_input, onnx_model_path)
# Convert this model into the OpenVINO IR using the Model Optimizer:
!mo --framework=onnx --data_type=FP32 --input_shape=[1,3,32,32] -m $onnx_model_path --output_dir $MODEL_DIR
4 压缩准备好的模型
ir_model_xml = onnx_model_path.with_suffix('.xml')
ir_model_bin = onnx_model_path.with_suffix('.bin')
!pot -q default -m $ir_model_xml -w $ir_model_bin --engine simplified --data-source $CALIB_DIR --output-dir compressed --direct-dump --name $MODEL_NAME
低精度优化简化模式就是一句话就搞定了。
5 模型比较
最后,我们将测量 FP32 和 INT8 模型的推理性能。 为此,我们使用 Benchmark Tool - OpenVINO 的推理性能测量工具。
注意: 为了获得更准确的性能,我们建议在关闭其他应用程序后在终端/命令提示符下运行 benchmark_app。
对于FP32全精度模型:
!benchmark_app -m $ir_model_xml -d CPU -api async
结果如下:
[Step 1/11] Parsing and validating input arguments
[ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
CPU
openvino_intel_cpu_plugin version 2022.1
Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1
[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 17.00 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'input.1' precision u8, dimensions ([N,C,H,W]): 1 3 32 32
[ INFO ] Model output '208' precision f32, dimensions ([...]): 1 10
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 60.01 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ] AVAILABLE_DEVICES , ['']
[ INFO ] RANGE_FOR_ASYNC_INFER_REQUESTS , (1, 1, 1)
[ INFO ] RANGE_FOR_STREAMS , (1, 8)
[ INFO ] FULL_DEVICE_NAME , Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
[ INFO ] OPTIMIZATION_CAPABILITIES , ['FP32', 'FP16', 'INT8', 'BIN', 'EXPORT_IMPORT']
[ INFO ] CACHE_DIR ,
[ INFO ] NUM_STREAMS , 4
[ INFO ] INFERENCE_NUM_THREADS , 0
[ INFO ] PERF_COUNT , False
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS , 0
[Step 9/11] Creating infer requests and preparing input data
[ INFO ] Create 4 infer requests took 1.00 ms
[ WARNING ] No input files were given for input 'input.1'!. This input will be filled with random values!
[ INFO ] Fill input 'input.1' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests using 4 streams for CPU, inference only: True, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 5.95 ms
[Step 11/11] Dumping statistics report
Count: 104180 iterations
Duration: 60004.15 ms
Latency:
Median: 2.07 ms
AVG: 2.25 ms
MIN: 1.13 ms
MAX: 43.36 ms
Throughput: 1736.21 FPS
对于INT8全精度模型:
!benchmark_app -m $optimized_model_xml -d CPU -api async
结果如下:
[Step 1/11] Parsing and validating input arguments
[ WARNING ] -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
CPU
openvino_intel_cpu_plugin version 2022.1
Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1
[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 42.00 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'input.1' precision u8, dimensions ([N,C,H,W]): 1 3 32 32
[ INFO ] Model output '208' precision f32, dimensions ([...]): 1 10
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 176.01 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ] AVAILABLE_DEVICES , ['']
[ INFO ] RANGE_FOR_ASYNC_INFER_REQUESTS , (1, 1, 1)
[ INFO ] RANGE_FOR_STREAMS , (1, 8)
[ INFO ] FULL_DEVICE_NAME , Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
[ INFO ] OPTIMIZATION_CAPABILITIES , ['FP32', 'FP16', 'INT8', 'BIN', 'EXPORT_IMPORT']
[ INFO ] CACHE_DIR ,
[ INFO ] NUM_STREAMS , 4
[ INFO ] INFERENCE_NUM_THREADS , 0
[ INFO ] PERF_COUNT , False
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS , 0
[Step 9/11] Creating infer requests and preparing input data
[ INFO ] Create 4 infer requests took 0.00 ms
[ WARNING ] No input files were given for input 'input.1'!. This input will be filled with random values!
[ INFO ] Fill input 'input.1' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests using 4 streams for CPU, inference only: True, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 14.62 ms
[Step 11/11] Dumping statistics report
Count: 130440 iterations
Duration: 60002.63 ms
Latency:
Median: 1.60 ms
AVG: 1.78 ms
MIN: 0.84 ms
MAX: 46.28 ms
Throughput: 2173.90 FPS