本文链接：https://blog.csdn.net/zyctimes/article/details/124504656

openvino系列 8. 训练后优化工具 Post-training Optimization Tool (POT) 在物体识别模型中的使用

本章节介绍英特尔 OpenVINO Post-training Optimization Tool 在物体识别模型量化/优化的一个案例。 POT 可以说是在CPU端加速推理的一个工具，而 POT API 有助于为单个或级联/复合深度学习模型实现自定义优化管道。实现自定义优化/量化模型的流程大致如下：

导入IR全精度模型。如果模型是PyTorch或TensorFlow格式，则需要先把其转换成IR格式模型；
DataLoader（DetectionDataLoader类）模块自定义：此模块负责数据集的加载，包括数据预处理。
Metric（MAPMetric类）模块自定义：此模块负责计算模型的准确度指标。
Engine模块自定义：此模块负责模型推理，并为模型提供统计数据和准确度指标。
组建并运行管道（Pipeline）。

环境描述：

本案例运行环境：Win10，10代i5笔记本
IDE：VSCode
openvino版本：2022.1
代码链接，6_pot_objectdetection

文章目录

openvino系列 8. 训练后优化工具 Post-training Optimization Tool (POT) 在物体识别模型中的使用

1 背景

1.1 训练后优化工具 Post-Training Optimization Tool（POT）

训练后优化工具 (POT) 旨在通过应用无需模型重新训练或微调的特殊方法来加速深度学习模型的推理（英特尔没有开源，所以我们就把它当成一个黑箱吧）。因此，该工具不需要训练数据集或管道。要应用POT，我们需要：

一个浮点精度的模型，比如FP32或者FP16，这个模型可以被转化成OpenVINO的IR格式，然后在CPU上运行；
代表用例场景的代表性校准数据集，例如 300 张图像。

该工具旨在完全自动化模型转换过程，而无需在用户端更改模型。需要注意的是，这个POT是英特尔针对CPU进行的模型优化工具。从benchmarking网页。

下图是英伟达对比了四款CPU，在使用了POT之后，模型推理速度普遍得到提升：

在这里插入图片描述

下图是英伟达对比了四款CPU，在使用了POT之后，对比原本的全精度模型，精度有略微地下降：

在这里插入图片描述

关于POT的官方描述见此链接。

1.2 训练后优化工具 API （POT API）

我们从上一章节中可以大致了解，其实POT可以说是在CPU端加速推理的一个工具。那么，怎么使用这个工具呢？这里有两种方式。一种是简化模式（Simplified Mode），就是说，我们不需要做什么设置，一个全精度模型进去，INT8精度模型出来，一行代码搞定。这个模式的案例我们在5-pot-int8-simplifiedmode举了一个案例。这种模式虽然简单直白，但我们只能把中间的过程完全当成一个黑盒子。另外一种方式就是这里要介绍的POT API。

POT API 有助于为单个或级联/复合深度学习模型实现自定义优化管道。官方描述链接。

在这里插入图片描述

从上图中（我们主要看左半边User API），我们看到，POT API分为三个模块：Engine, Matric, DataLoader。

Engine负责模型推理，并为模型提供统计数据和准确度指标。
DataLoader负责数据集的加载，包括数据预处理。
Metric负责计算模型的准确度指标。

其实这背后的逻辑也是比较清晰的。首先我们使用DataLoader模块来读取和解析校准数据集（Dataset&Annotation），然后我们定义一个指标来确定优化之后的模型的性能（Metric），然后我们对Engine做一些配置，最后将这些模块放进优化管道（optimization pipeline），运行后，生成优化模型。

1.3 关于导入的模型

这个案例的目的是识别人。因此，我们导入的模型是person-detection-retail-0013。

英特尔提供了一个open model zoo，里面包含了非常多的已经训练好的模型。并且每一个模型都有详细的描述，这点做得非常好。关于这个模型：这是零售场景的行人检测器。它基于类似于MobileNetV2的主干网络，包括深度卷积以减少3x3卷积块的计算量。来自1/16比例特征图的单个SSD头具有12个先验框。

模型的输入：[1,3,320,544]，格式为[B,C,H,W]，即[批量大小,通道数,图像高度,图像宽度]；
模型的输出：[1,1,200,7]，其中N是检测到的边界框的数量。每个检测的格式为[image_id,label,conf,x_min,y_min,x_max,y_max]，即：[批处理中图像的ID,标签-预测的类别ID（1-人）,预测类的置信度,左上边界框角的坐标,右下边界框角的坐标]。

正如之前的章节所述，POT API分为三个模块：Engine, Matric, DataLoader。接下来我们将介绍使用POT API的步骤。

2 导入IR模型

首先，我们需要导入IR模型。如果模型是PyTorch或TensorFlow格式，则需要先把其转换成IR格式模型。相关代码：

print("Download the model from Open Model Zoo.") 
ir_path = Path("intel/person-detection-retail-0013/FP32/person-detection-retail-0013.xml")
if not ir_path.exists():
    ! omz_downloader --name "person-detection-retail-0013" --precisions FP32
print("Load the IR model, and get information about network inputs and outputs.") 
ie = Core()
model = ie.read_model(model=ir_path)
compiled_model = ie.compile_model(model=model, device_name="CPU")
input_layer = compiled_model.input(0)
output_layer = compiled_model.output(0)
print("model input info: {}".format(input_layer))
print("model output info: {}".format(output_layer))
input_size = input_layer.shape
_, _, input_height, input_width = input_size

Terminal中打印：

Download the model from Open Model Zoo.
Load the IR model, and get information about network inputs and outputs.
model input info: <ConstOutput: names[data] shape{1,3,320,544} type: f32>
model output info: <ConstOutput: names[detection_out] shape{1,1,200,7} type: f32>

3 `DataLoader`（`DetectionDataLoader`类）

3.1 介绍

要实现 Metric 和 Dataloader，我们需要知道模型的输出和注释格式。DataLoader负责数据集的加载，包括数据预处理。

此示例中的数据集使用 JSON 格式的注释，键为：['categories', 'annotations', 'images']。 annotations 是一个字典列表，每个注解一个条目。此类项目包含一个 boxes 键，它以[xmin, xmax, ymin, ymax] 格式保存预测框。在这个数据集中只有一个标签：“人”。

categories示例：

"categories": [
        {
            "id": 1,
            "name": "person",
            "supercategory": ""
        }
    ],

annotations示例：

"annotations": [
        {
            "category_id": 1,
            "segmentation": null,
            "bbox": [
                1008,
                199,
                185,
                458
            ],
            "iscrowd": 0,
            "area": 84730,
            "id": 0,
            "image_id": 0,
            "attributes": {},
            "is_occluded": false
        },

images示例：

"images": [
        {
            "date_captured": null,
            "flickr_url": null,
            "width": 1920,
            "dataset": "IOTG_RSD_Team_datasets",
            "file_name": "image_000000.jpg",
            "license": null,
            "image": "train/image_000000.jpg",
            "id": 0,
            "height": 1080,
            "coco_url": null
        },

3.2 `DetectionDataLoader`类

DetectionDataLoader 类遵循 POT 的 compression.api.DataLoader 接口，它实现了 __init__、__getitem__ 和 __len__，其中 __getitem__ 返回以 (annotation, image) 或者 ( annotation, image, metadata)，注解为(index, label)。

需要注意的是，当我们在实例化DetectionDataLoader 类的时候，我们需输入：

basedir：指的是包含校准数据集以及annotation文件的文件夹路径；
target_size：类型是Tuple[int, int]，比如(input_width, input_height)，需要输入的是这些校准图片输入进IR模型之后需要调整到的尺寸大小，即IR模型的输入尺寸大小。

DetectionDataLoader类的代码如下：

class DetectionDataLoader(DataLoader):
    def __init__(self, basedir: str, target_size: Tuple[int, int]):
        """
        :param basedir: Directory that contains images and annotation as "annotation.json"
        :param target_size: Tuple of (width, height) to resize images to.
        """
        self.images = sorted(Path(basedir).glob("*.jpg"))
        self.target_size = target_size
        with open(f"{basedir}/annotation_person_train.json") as f:
            self.annotations = json.load(f)
        self.image_ids = {
            Path(item["file_name"]).name: item["id"]
            for item in self.annotations["images"]
        }

        for image_filename in self.images:
            annotations = [
                item
                for item in self.annotations["annotations"]
                if item["image_id"] == self.image_ids[Path(image_filename).name]
            ]
            assert (
                len(annotations) != 0
            ), f"No annotations found for image id {image_filename}"

        print(
            f"Created dataset with {len(self.images)} items. Data directory: {basedir}"
        )

    def __getitem__(self, index):
        """
        Get an item from the dataset at the specified index.
        Detection boxes are converted from absolute coordinates to relative coordinates
        between 0 and 1 by dividing xmin, xmax by image width and ymin, ymax by image height.

        :return: (annotation, input_image, metadata) where annotation is (index, target_annotation)
                 with target_annotation as a dictionary with keys category_id, image_width, image_height
                 and bbox, containing the relative bounding box coordinates [xmin, ymin, xmax, ymax]
                 (with values between 0 and 1) and metadata a dictionary: {"filename": path_to_image}
        """
        image_path = self.images[index]
        image = cv2.imread(str(image_path))
        image = cv2.resize(image, self.target_size)
        image_id = self.image_ids[Path(image_path).name]

        # image_info contains height and width of the annotated image
        image_info = [
            image for image in self.annotations["images"] if image["id"] == image_id
        ][0]
        # image_annotations contains the boxes and labels for the image
        image_annotations = [
            item
            for item in self.annotations["annotations"]
            if item["image_id"] == image_id
        ]

        # annotations are in xmin, ymin, width, height format. Convert to
        # xmin, ymin, xmax, ymax and normalize to image width and height as
        # stored in the annotation
        target_annotations = []
        for annotation in image_annotations:
            xmin, ymin, width, height = annotation["bbox"]
            xmax = xmin + width
            ymax = ymin + height
            xmin /= image_info["width"]
            ymin /= image_info["height"]
            xmax /= image_info["width"]
            ymax /= image_info["height"]
            target_annotation = {}
            target_annotation["category_id"] = annotation["category_id"]
            target_annotation["image_width"] = image_info["width"]
            target_annotation["image_height"] = image_info["height"]
            target_annotation["bbox"] = [xmin, ymin, xmax, ymax]
            target_annotations.append(target_annotation)

        item_annotation = (index, target_annotations)
        input_image = np.expand_dims(image.transpose(2, 0, 1), axis=0).astype(
            np.float32
        )
        return (
            item_annotation,
            input_image,
            {"filename": str(image_path), "shape": image.shape},
        )

    def __len__(self):
        return len(self.images)

4 `Metric`（`MAPMetric`类）

定义一个metric来确定模型的性能。如果我们选择默认量化算法，定义metric不是必须的，但它可用于将量化的 INT8 模型与原始的全精度 IR 模型进行比较。

在本教程中，我们使用来自 TorchMetrics 的 MAP 指标。另外，POT 的 metric 继承自 compression.api.Metric。

5 `Engine`

至此，POT API三大模块就只剩下了Engine。Engine模块负责模型推理，并为模型提供统计数据和准确度指标。首先我们需要进行一些配置。

model_config包含了 IR 模型的名称，路径（变量 ir_path 指向 IR 模型的 xml 文件），以及权重文件；
engine_config包含了对这个优化推理模型的配置，这里只是设置了使用CPU进行模型推理；
default_algorithms优化算法的选择，这里使用DefaultQuantization 算法。

请参阅训练后优化最佳实践和POT官方文档文档页面以获取有关设置和最佳实践的更多信息。

6 组建并运行管道（Pipeline）

当我们配置完 POT API这三个模块：Engine, Matric, DataLoader之后，我们就需要组件管道（Pipeline）了。配置管道需要如下几个步骤：

实例化DetectionDataLoader。我们在DataLoader这个模块的时候，编辑了DetectionDataLoader 类，这里我们需要对其进行实例化，命名为data_loader。
加载模型。load_model() 加载在 model_config 中指定的 IR 模型。
实例化MAPMetric。我们在Metric这个模块的时候，编辑了MAPMetric类，这里我们需要对其进行实例化，命名为metric。
初始化Engine。IEEngine是推理引擎的 POT 实现，它将被传递到由 create_pipeline() 创建的 POT 管道。IEEngine需要三个输入：engine_config，data_loader，metric。这也就回到了文章开头的那张图。当我们初始化DataLoader以及Matric之后，我们将它们的实例作为输入到IEEngine。
初始化和运行管道。创建和运行 POT 管道只需要两行代码。我们使用 create_pipeline 函数创建管道，然后使用 pipeline.run() 运行该管道。
保存优化后的模型。为了稍后重用量化模型，我们压缩模型权重并将压缩模型保存到磁盘。

7 模型比较

最后，我们来比较一下全精度模型与优化/量化之后的模型，从以下几个角度：

	mAP	模型大小	`benchmark_app`
FP32 全精度模型	0.67329	2823.60 KB	38.51 FPS
INT8 量化模型	0.66534	806.62 KB	90.03 FPS

7.1 mAP

# Compute the mAP on the quantized model and compare with the mAP on the FP16 IR model.
ir_model = load_model(model_config=model_config)
evaluation_pipeline = create_pipeline(algo_config=dict(), engine=engine)

with yaspin(text="Evaluating original IR model") as sp:
    original_metric = evaluation_pipeline.evaluate(ir_model)

with yaspin(text="Evaluating quantized IR model") as sp:
    quantized_metric = pipeline.evaluate(compressed_model)

if original_metric:
    for key, value in original_metric.items():
        print(f"The {key} score of the original FP32 model is {value:.5f}")

if quantized_metric:
    for key, value in quantized_metric.items():
        print(f"The {key} score of the quantized INT8 model is {value:.5f}")

我们得到的结果如下：

The map score of the original FP32 model is 0.67329
The map score of the quantized INT8 model is 0.66534

7.2 模型大小

original_model_size = Path(ir_path).with_suffix(".bin").stat().st_size / 1024
quantized_model_size = (
    Path(compressed_model_path).with_suffix(".bin").stat().st_size / 1024
)

print(f"FP32 model size: {original_model_size:.2f} KB")
print(f"INT8 model size: {quantized_model_size:.2f} KB")

我们得到的结果如下：

FP32 model size: 2823.60 KB
INT8 model size: 806.62 KB

7.3 使用`benchmark_app`比较原始全精度模型和量化模型的性能

为了测量 FP32 和 INT8 模型的推理性能，我们使用 OpenVINO 的基准测试解决方案 Benchmark Tool。可以在笔记本中运行：！benchmark_app 或 %sx benchmark_app。

对于 Benchmark FP32 模型：

!benchmark_app -m $ir_path -d CPU -api async -t 15 -b 1 -cdir model_cache

我们得到结果

Output exceeds the size limit. Open the full output data in a text editor
[Step 1/11] Parsing and validating input arguments
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
         API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
         CPU
         openvino_intel_cpu_plugin version 2022.1
         Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 55.00 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'data' precision u8, dimensions ([N,C,H,W]): 1 3 320 544
[ INFO ] Model output 'detection_out' precision f32, dimensions ([...]): 1 1 200 7
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 333.00 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ]   AVAILABLE_DEVICES  , ['']
...
    AVG:        103.64 ms
    MIN:        30.67 ms
    MAX:        555.79 ms
Throughput: 38.51 FPS

对于量化后的 INT8 模型：

!benchmark_app -m $compressed_model_path -d CPU -api async -t 15 -b 1 -cdir model_cache

我们得到结果

Output exceeds the size limit. Open the full output data in a text editor
[Step 1/11] Parsing and validating input arguments
[ WARNING ]  -nstreams default value is determined automatically for a device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README. 
[Step 2/11] Loading OpenVINO
[ WARNING ] PerformanceMode was not explicitly specified in command line. Device CPU performance hint will be set to THROUGHPUT.
[ INFO ] OpenVINO:
         API version............. 2022.1.0-7019-cdb9bec7210-releases/2022/1
[ INFO ] Device info
         CPU
         openvino_intel_cpu_plugin version 2022.1
         Build................... 2022.1.0-7019-cdb9bec7210-releases/2022/1

[Step 3/11] Setting device configuration
[ WARNING ] -nstreams default value is determined automatically for CPU device. Although the automatic selection usually provides a reasonable performance, but it still may be non-optimal for some cases, for more information look at README.
[Step 4/11] Reading network files
[ INFO ] Read model took 93.97 ms
[Step 5/11] Resizing network to match image sizes and given batch
[ INFO ] Network batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model input 'data' precision u8, dimensions ([N,C,H,W]): 1 3 320 544
[ INFO ] Model output 'detection_out' precision f32, dimensions ([...]): 1 1 200 7
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 681.53 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] DEVICE: CPU
[ INFO ]   AVAILABLE_DEVICES  , ['']
...
    AVG:        44.31 ms
    MIN:        32.49 ms
    MAX:        153.90 ms
Throughput: 90.03 FPS