openvino系列 15. OpenVINO OCR

最新推荐文章于 2024-05-20 11:39:14 发布

破浪会有时

最新推荐文章于 2024-05-20 11:39:14 发布

阅读量987

点赞数

分类专栏： openvino案例分析 ocr 文章标签： openvino 目标检测

本文链接：https://blog.csdn.net/zyctimes/article/details/124761580

版权

openvino案例分析同时被 2 个专栏收录

20 篇文章 18 订阅

订阅专栏

ocr

5 篇文章 0 订阅

订阅专栏

openvino系列 15. OpenVINO OCR

此案例主要解释如何使用 OpenVINO OCR 模型进行字体检测（detection）和识别（recognition）。总体上尝试下来的，OpenVINO提供的OCR模块效果一般，因为这个模块只能识别数字和字母，遇到特殊字符会影响识别的精度，而且对于文字的角度与分辨率也有一定要求。

字体检测（detection）任务对应模型：horizontal-text-detection-0001。
字体识别（recognition）任务对应模型：text-recognition-0014。

环境描述：

本案例运行环境：Win10，10代i5笔记本
IDE：VSCode
openvino版本：2022.1
代码链接，11-OCR

文章目录

openvino系列 15. OpenVINO OCR

1. 关于模型的使用

OpenVINO 的 Model Zoo 提供了很多预训练模型。

1.1 字体检测预训练模型

关于字体检测的模型，Model Zoo 提供了如下几个：

	horizontal-text-detection-0001	text-detection-0003	text-detection-0004
说明	based on FCOS architecture with MobileNetV2-like as a backbone	based on PixelLink architecture with MobileNetV2-like as a backbone	based on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone
输入	[1,3,704,704]，对应 [1,C,H,W]	[1,768,1280,3]，对应 [B,H,W,C]	[1,768,1280,3]，对应 [B,H,W,C]
输出1	`boxes`：[N,5]，其中 N 是检测到的边界框的数量。每个检测框格式为：[x_min,y_min,x_max,y_max,conf]	`model/link_logits_/add`：[1,192,320,16]，logits related to linkage between pixels and their neighbors	`model/link_logits_/add`：[1,192,320,16]，logits related to linkage between pixels and their neighbors
输出2	`labels`：[N]，其中 N 是检测到的边界框的数量，在文本检测的情况下，每个检测到的框的值都等于0。	`model/segm_logits/add`：[1,192,320,2]，logits related to text/no-text classification for each pixel	`model/segm_logits/add`：[1,192,320,2]，logits related to text/no-text classification for each pixel

B - batch size；H - image height；W - image width；C - number of channels。

1.2 FCOS 回顾

horizontal-text-detection-0001这个模型是通过FCOS训练而来的。这里我们对FCOS（Fully Convolutional One-Stage Object Detection）做一个简单的回顾。

FCOS是一个端到端的anchor-free one-stage 物体识别算法，网络结构如下图，由如下三部分组成：

backbone网络；
feature pyramid结构；
输出部分（classification/Regression/Center-ness）；

在这里插入图片描述

根据FPN，我们在不同层次对特征图上检测不同尺寸的物体。具体来说，我们抽出五层特征图，分别定义为{ $P_3$ , $P_4$ , $P_5$ , $P_6$ , $P_7$ }。 $P_3$ , $P_4$ , $P_5$ 由主干CNN的特征图 $C_3$ , $C_4$ , $C_5$ 经过一个1x1卷积横向连接得到。 $P_6$ , $P_7$ 分别由 $P_5$ , $P_6$ 经过一个stride=2的卷积层得到。所以，最后我们得到的 $P_3$ , $P_4$ , $P_5$ , $P_6$ , $P_7$ 分别对应stride 8,16,32,64,128。

右侧的 Head 是 FCOS 的重点部分，可以看到每层 feature 被分为了两个分支，上面的分支用于做分类，下面的分支用于做目标框位置的回归。分类的分支还有一个 Center-ness 分支用于做中心点的预测。不同于传统的中心点 + 宽高或者坐标点的形式，FCOS 通过中心点和一个4D vector(l,t,r,b)来预测物体框的位置。

在这里插入图片描述

最后，注意一点，FCOS 中只要 feature map 某个位置的点落入 groundtruth 的 bbox 中就被认为是正样本，可见用于训练的正样本的数量将会非常的多。

Cost Function这里就不赘述了，我们只是在这里回顾一下 FCOS 算法的整体逻辑。

1.3 PixelLink 算法回顾

text-detection-0003和text-detection-0004背后的算法是基于PixelLink: Detecting Scene Text via Instance Segmentation。这里，我们对PixelLink做一个简单的回顾。

对于一般的基于深度学习的文字检测模型，其主要的实现步骤是判断是不是文本，并且给出文本框的位置和角度，如下图：

在这里插入图片描述

上一章节那个 FCOS 模型虽然不是专门检测文字的，但整体逻辑类似，都是最后有一个回归，一个分类。

在这里插入图片描述

PixelLink主要有两个部分：Pixel（像素）、Link（连接）。PixelLink主要是基于CNN网络，做某个像素（pixel）的文本/非文本的分类预测，以及该像素的8个邻域方向是否存在连接（link）的分类预测（即上图中虚线框内的八个热图，代表八个方向的连接预测）。

在这里插入图片描述

PixelLink网络结构的骨干（backbone）采用VGG16作为特征提取器，将最后的全连接层fc6、fc7替换为卷积层，特征融合和像素预测的方式基于FPN思想（feature pyramid network，金字塔特征网络），即卷积层的尺寸依次减半，但卷积核的数量依次增倍。该模型结构有两个独立的头，一个用于文本/非文本预测（Text/non-text Prediction），另一个用于连接预测（Link Prediction），这两者都使用了Softmax，输出1x2=2通道（文本/非文本的分类）和8x2=16通道（8个邻域方向是否有连接的分类）。

1.4 字体识别预训练模型

关于字体识别的模型，Model Zoo 提供了如下几个：

	text-recognition-0012	text-recognition-0014	text-recognition-resnet-fc
说明	VGG16-like backbone and bidirectional LSTM encoder-decoder	ResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder.	model based on ResNet with Fully Connected text recognition head
Accuracy in ICDAR13 Dataset	0.8818	0.8887	92.96%
输入	[1,32,120,1]，对应 [B,H,W,C]	[1,1,32,128]，对应 [B,C,H,W]	[1,1,32,100]，对应 [B,C,H,W]
注意	source image should be tight aligned crop with detected text converted to grayscale.	source image should be tight aligned crop with detected text converted to grayscale.	source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5.
输出	`boxes`：[30,1,37]，对应[W,B,L]，L的顺序：`0123456789abcdefghijklmnopqrstuvwxyz#`	[16,1,37]，对应[W,B,L]，L的顺序：`#0123456789abcdefghijklmnopqrstuvwxyz`	[1,26,37]，对应[B,W,L]，L的顺序：`[s]0123456789abcdefghijklmnopqrstuvwxyz`

B - batch size；H - image height；W - image width；C - number of channels；W：output sequence length；L：confidence distribution across alphanumeric symbols。

1.5 最终选择

最终我们选择：

字体检测（detection）任务对应模型：horizontal-text-detection-0001。
字体识别（recognition）任务对应模型：text-recognition-0014。

2. 代码

2.1 下载模型

首先，和其他模型一样，我们还是先下载模型。

import shutil
import sys
from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown, display
from PIL import Image
from openvino.runtime import Core
from yaspin import yaspin
import numpy
from PIL import Image, ImageOps

ie = Core()
model_dir = Path("model")
precision = "FP16"
detection_model = "horizontal-text-detection-0001"
recognition_model = "text-recognition-0014"
#base_model_dir = Path("~/open_model_zoo_models").expanduser()
base_model_dir = Path("./model/open_model_zoo_models").expanduser()
#omz_cache_dir = Path("~/open_model_zoo_cache").expanduser()
omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser()
model_dir.mkdir(exist_ok=True)
'''
下载模型
'''
print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.")
ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml")
ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml")

if not ir_path_detection_model.exists() and ir_path_recognition_model.exists():
    download_command = f"omz_downloader " \
                        f"--name {detection_model},{recognition_model} " \
                        f"--output_dir {base_model_dir} " \
                        f"--cache_dir {omz_cache_dir} " \
                        f"--precision {precision}"

    display(Markdown(f"Download command: `{download_command}`"))
    with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp:
        download_result = !$download_command
        print(download_result)
        sp.text = f"Finished downloading {detection_model}, {recognition_model}"
        sp.ok("✔")
else:
    print("IR model already exists.")

2.2 字体检测模型

加载检测模型：horizontal-text-detection-0001；
加载图像，并调整其尺寸使之和模型的输入尺寸吻合；
模型推理，并返回检测推理结果。

首先，我们加载检测模型，并且看一下这个模型的输入输出：

print("2 - Load detection Model: horizontal-text-detection-0001")

detection_model = ie.read_model(
    model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin")
)
detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU")

detection_input_layer = detection_compiled_model.input(0)
detection_output_layer_box = detection_compiled_model.output('boxes')
detection_output_layer_label = detection_compiled_model.output('labels')

print("- Input of detection model shape: {}".format(detection_input_layer))
print("- Output `box` of detection model shape: {}".format(detection_output_layer_box))
print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))

Terminal打印：

2 - Load detection Model.
- Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32>
- Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32>
- Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>

接下来，我们导入图片，并调整其尺寸使之和模型的输入尺寸吻合。

print("3 - Load Image and resize into model input shape.")

# Read the image
image = cv2.imread("data/label4.png")
print("- Input image size: {}".format(image.shape))
# N,C,H,W = batch size, number of channels, height, width
N, C, H, W = detection_input_layer.shape

# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))

# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)
print("- Input image is resized (with padding) into: {}".format(input_image.shape))

plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));

Terminal打印：

3 - Load Image and resize into model input shape.
- Input image size: (256, 644, 3)
- Input image is resized (with padding) into: (1, 3, 704, 704)

在这里插入图片描述

模型推理的代码如下：

'''
### 模型推理
在图像中检测到文本框并以`[100, 5]`形状的数据块形式返回。每个检测描述的格式为 `[x_min, y_min, x_max, y_max, conf]`。
'''
print("4 - Detection model inference.")
output_key = detection_compiled_model.output("boxes")
boxes = detection_compiled_model([input_image])[output_key]

# Remove zero only boxes
boxes = boxes[~np.all(boxes == 0, axis=1)]
print("- Detect {} boxes.".format(boxes.shape[0]))

Terminal打印：

4 - Detection model inference.
- Detect 4 boxes.

2.3 字体识别模型

文字识别模型和文字检测模型导入和推理的步骤是类似的，这里我们就直接上代码了：

def multiply_by_ratio(ratio_x, ratio_y, box):
    return [
        max(shape * ratio_y, 10) if idx % 2 else shape * ratio_x
        for idx, shape in enumerate(box[:-1])
    ]


def run_preprocesing_on_crop(crop, net_shape):
    temp_img = cv2.resize(crop, net_shape)
    temp_img = temp_img.reshape((1,) * 2 + temp_img.shape)
    return temp_img


def convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True):
    # Define colors for boxes and descriptions
    colors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)}

    # Fetch image shapes to calculate ratio
    (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
    ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

    # Convert base image from bgr to rgb format
    rgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)

    # Iterate through non-zero boxes
    for box, annotation in boxes:
        # Pick confidence factor from last place in array
        conf = box[-1]
        if conf > threshold:
            # Convert float to int and multiply position of each box by x and y ratio
            (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box))

            # Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thickness
            cv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3)

            # Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_type
            if conf_labels:
                # Create background box based on annotation length
                (text_w, text_h), _ = cv2.getTextSize(
                    f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1
                )
                image_copy = rgb_image.copy()
                cv2.rectangle(
                    image_copy,
                    (x_min, y_min - text_h - 10),
                    (x_min + text_w, y_min - 10),
                    colors["white"],
                    -1,
                )
                # Add weighted image copy with white boxes under text
                cv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image)
                cv2.putText(
                    rgb_image,
                    f"{annotation}",
                    (x_min, y_min - 10),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.8,
                    colors["red"],
                    1,
                    cv2.LINE_AA,
                )

    return rgb_image

print("5 - Load Recognition Model: text-recognition-0014")

recognition_model = ie.read_model(
    model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin")
)

recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU")

recognition_output_layer = recognition_compiled_model.output(0)
recognition_input_layer = recognition_compiled_model.input(0)

# Get height and width of input layer
_, _, Hrecog, Wrecog = recognition_input_layer.shape

print("- Input of recognition model shape: {}".format(recognition_input_layer))
print("- Output of recognition model shape: {}".format(recognition_output_layer))

'''
模型推理
'''
# Calculate scale for image resizing
(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

# Convert image to grayscale for text recognition model
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Get dictionary to encode output, based on model documentation
letters = "~0123456789abcdefghijklmnopqrstuvwxyz"

# Prepare empty list for annotations
annotations = list()
cropped_images = list()
# fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True)
# For each crop, based on boxes given by detection model we want to get annotations
for i, crop in enumerate(boxes):
    # Get coordinates on corners of crop
    (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop))
    image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog))

    # Run inference with recognition model
    result = recognition_compiled_model([image_crop])[recognition_output_layer]

    # Squeeze output to remove unnececery dimension
    recognition_results_test = np.squeeze(result)

    # Read annotation based on probabilities from output layer
    annotation = list()
    for letter in recognition_results_test:
        parsed_letter = letters[letter.argmax()]

        # 如果我们检测到数字，都需要-1
        if parsed_letter.isnumeric():
            parsed_letter = int(parsed_letter)
            parsed_letter = parsed_letter + 1
            if parsed_letter == 10:
                parsed_letter = 0
            parsed_letter = str(parsed_letter)
        # Returning 0 index from argmax signalises end of string
        if parsed_letter == letters[0]:
            continue
        annotation.append(parsed_letter)
    annotations.append("".join(annotation))
    cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max])
    cropped_images.append(cropped_image)

boxes_with_annotations = list(zip(boxes, annotations))