openvino系列 15. OpenVINO OCR

openvino系列 15. OpenVINO OCR

此案例主要解释如何使用 OpenVINO OCR 模型进行字体检测(detection)和识别(recognition)。总体上尝试下来的,OpenVINO提供的OCR模块效果一般,因为这个模块只能识别数字和字母,遇到特殊字符会影响识别的精度,而且对于文字的角度与分辨率也有一定要求。

环境描述:

  • 本案例运行环境:Win10,10代i5笔记本
  • IDE:VSCode
  • openvino版本:2022.1
  • 代码链接11-OCR


1. 关于模型的使用

OpenVINO 的 Model Zoo 提供了很多预训练模型。

1.1 字体检测预训练模型

关于字体检测的模型,Model Zoo 提供了如下几个:

horizontal-text-detection-0001text-detection-0003text-detection-0004
说明based on FCOS architecture with MobileNetV2-like as a backbonebased on PixelLink architecture with MobileNetV2-like as a backbonebased on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone
输入[1,3,704,704],对应 [1,C,H,W][1,768,1280,3],对应 [B,H,W,C][1,768,1280,3],对应 [B,H,W,C]
输出1boxes:[N,5],其中 N 是检测到的边界框的数量。每个检测框格式为:[x_min,y_min,x_max,y_max,conf]model/link_logits_/add:[1,192,320,16],logits related to linkage between pixels and their neighborsmodel/link_logits_/add:[1,192,320,16],logits related to linkage between pixels and their neighbors
输出2labels:[N],其中 N 是检测到的边界框的数量,在文本检测的情况下,每个检测到的框的值都等于0。model/segm_logits/add:[1,192,320,2],logits related to text/no-text classification for each pixelmodel/segm_logits/add:[1,192,320,2],logits related to text/no-text classification for each pixel

B - batch size;H - image height;W - image width;C - number of channels。

1.2 FCOS 回顾

horizontal-text-detection-0001这个模型是通过FCOS训练而来的。这里我们对FCOS(Fully Convolutional One-Stage Object Detection)做一个简单的回顾。

FCOS是一个端到端的anchor-free one-stage 物体识别算法,网络结构如下图,由如下三部分组成:

  1. backbone网络;
  2. feature pyramid结构;
  3. 输出部分(classification/Regression/Center-ness);

在这里插入图片描述

根据FPN,我们在不同层次对特征图上检测不同尺寸的物体。具体来说,我们抽出五层特征图,分别定义为{ P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7}。 P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5 由主干CNN的特征图 C 3 C_3 C3, C 4 C_4 C4, C 5 C_5 C5 经过一个1x1卷积横向连接得到。 P 6 P_6 P6, P 7 P_7 P7 分别由 P 5 P_5 P5, P 6 P_6 P6 经过一个stride=2的卷积层得到。所以,最后我们得到的 P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7 分别对应stride 8,16,32,64,128。

右侧的 Head 是 FCOS 的重点部分,可以看到每层 feature 被分为了两个分支,上面的分支用于做分类,下面的分支用于做目标框位置的回归。分类的分支还有一个 Center-ness 分支用于做中心点的预测。不同于传统的中心点 + 宽高或者坐标点的形式,FCOS 通过中心点和一个4D vector(l,t,r,b)来预测物体框的位置。

在这里插入图片描述

最后,注意一点,FCOS 中只要 feature map 某个位置的点落入 groundtruth 的 bbox 中就被认为是正样本,可见用于训练的正样本的数量将会非常的多。

Cost Function这里就不赘述了,我们只是在这里回顾一下 FCOS 算法的整体逻辑。

1.3 PixelLink 算法回顾

text-detection-0003text-detection-0004背后的算法是基于PixelLink: Detecting Scene Text via Instance Segmentation。这里,我们对PixelLink做一个简单的回顾。

对于一般的基于深度学习的文字检测模型,其主要的实现步骤是判断是不是文本,并且给出文本框的位置和角度,如下图:

在这里插入图片描述

上一章节那个 FCOS 模型虽然不是专门检测文字的,但整体逻辑类似,都是最后有一个回归,一个分类。

在这里插入图片描述

PixelLink主要有两个部分:Pixel(像素)、Link(连接)。PixelLink主要是基于CNN网络,做某个像素(pixel)的文本/非文本的分类预测,以及该像素的8个邻域方向是否存在连接(link)的分类预测(即上图中虚线框内的八个热图,代表八个方向的连接预测)。

在这里插入图片描述

PixelLink网络结构的骨干(backbone)采用VGG16作为特征提取器,将最后的全连接层fc6、fc7替换为卷积层,特征融合和像素预测的方式基于FPN思想(feature pyramid network,金字塔特征网络),即卷积层的尺寸依次减半,但卷积核的数量依次增倍。该模型结构有两个独立的头,一个用于文本/非文本预测(Text/non-text Prediction),另一个用于连接预测(Link Prediction),这两者都使用了Softmax,输出1x2=2通道(文本/非文本的分类)和8x2=16通道(8个邻域方向是否有连接的分类)。

1.4 字体识别预训练模型

关于字体识别的模型,Model Zoo 提供了如下几个:

text-recognition-0012text-recognition-0014text-recognition-resnet-fc
说明VGG16-like backbone and bidirectional LSTM encoder-decoderResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder.model based on ResNet with Fully Connected text recognition head
Accuracy in ICDAR13 Dataset0.88180.888792.96%
输入[1,32,120,1],对应 [B,H,W,C][1,1,32,128],对应 [B,C,H,W][1,1,32,100],对应 [B,C,H,W]
注意source image should be tight aligned crop with detected text converted to grayscale.source image should be tight aligned crop with detected text converted to grayscale.source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5.
输出boxes:[30,1,37],对应[W,B,L],L的顺序:0123456789abcdefghijklmnopqrstuvwxyz#[16,1,37],对应[W,B,L],L的顺序:#0123456789abcdefghijklmnopqrstuvwxyz[1,26,37],对应[B,W,L],L的顺序:[s]0123456789abcdefghijklmnopqrstuvwxyz

B - batch size;H - image height;W - image width;C - number of channels;W:output sequence length;L:confidence distribution across alphanumeric symbols。

1.5 最终选择

最终我们选择:

2. 代码

2.1 下载模型

首先,和其他模型一样,我们还是先下载模型。

import shutil
import sys
from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown, display
from PIL import Image
from openvino.runtime import Core
from yaspin import yaspin
import numpy
from PIL import Image, ImageOps

ie = Core()
model_dir = Path("model")
precision = "FP16"
detection_model = "horizontal-text-detection-0001"
recognition_model = "text-recognition-0014"
#base_model_dir = Path("~/open_model_zoo_models").expanduser()
base_model_dir = Path("./model/open_model_zoo_models").expanduser()
#omz_cache_dir = Path("~/open_model_zoo_cache").expanduser()
omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser()
model_dir.mkdir(exist_ok=True)
'''
下载模型
'''
print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.")
ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml")
ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml")

if not ir_path_detection_model.exists() and ir_path_recognition_model.exists():
    download_command = f"omz_downloader " \
                        f"--name {detection_model},{recognition_model} " \
                        f"--output_dir {base_model_dir} " \
                        f"--cache_dir {omz_cache_dir} " \
                        f"--precision {precision}"

    display(Markdown(f"Download command: `{download_command}`"))
    with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp:
        download_result = !$download_command
        print(download_result)
        sp.text = f"Finished downloading {detection_model}, {recognition_model}"
        sp.ok("✔")
else:
    print("IR model already exists.")

2.2 字体检测模型

  • 加载检测模型:horizontal-text-detection-0001
  • 加载图像,并调整其尺寸使之和模型的输入尺寸吻合;
  • 模型推理,并返回检测推理结果。

首先,我们加载检测模型,并且看一下这个模型的输入输出:

print("2 - Load detection Model: horizontal-text-detection-0001")

detection_model = ie.read_model(
    model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin")
)
detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU")

detection_input_layer = detection_compiled_model.input(0)
detection_output_layer_box = detection_compiled_model.output('boxes')
detection_output_layer_label = detection_compiled_model.output('labels')

print("- Input of detection model shape: {}".format(detection_input_layer))
print("- Output `box` of detection model shape: {}".format(detection_output_layer_box))
print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))

Terminal打印:

2 - Load detection Model.
- Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32>
- Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32>
- Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>

接下来,我们导入图片,并调整其尺寸使之和模型的输入尺寸吻合。

print("3 - Load Image and resize into model input shape.")

# Read the image
image = cv2.imread("data/label4.png")
print("- Input image size: {}".format(image.shape))
# N,C,H,W = batch size, number of channels, height, width
N, C, H, W = detection_input_layer.shape

# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))

# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)
print("- Input image is resized (with padding) into: {}".format(input_image.shape))

plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));

Terminal打印:

3 - Load Image and resize into model input shape.
- Input image size: (256, 644, 3)
- Input image is resized (with padding) into: (1, 3, 704, 704)

在这里插入图片描述

模型推理的代码如下:

'''
### 模型推理
在图像中检测到文本框并以`[100, 5]`形状的数据块形式返回。每个检测描述的格式为 `[x_min, y_min, x_max, y_max, conf]`。
'''
print("4 - Detection model inference.")
output_key = detection_compiled_model.output("boxes")
boxes = detection_compiled_model([input_image])[output_key]

# Remove zero only boxes
boxes = boxes[~np.all(boxes == 0, axis=1)]
print("- Detect {} boxes.".format(boxes.shape[0]))

Terminal打印:

4 - Detection model inference.
- Detect 4 boxes.

2.3 字体识别模型

文字识别模型和文字检测模型导入和推理的步骤是类似的,这里我们就直接上代码了:

def multiply_by_ratio(ratio_x, ratio_y, box):
    return [
        max(shape * ratio_y, 10) if idx % 2 else shape * ratio_x
        for idx, shape in enumerate(box[:-1])
    ]


def run_preprocesing_on_crop(crop, net_shape):
    temp_img = cv2.resize(crop, net_shape)
    temp_img = temp_img.reshape((1,) * 2 + temp_img.shape)
    return temp_img


def convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True):
    # Define colors for boxes and descriptions
    colors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)}

    # Fetch image shapes to calculate ratio
    (real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
    ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

    # Convert base image from bgr to rgb format
    rgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)

    # Iterate through non-zero boxes
    for box, annotation in boxes:
        # Pick confidence factor from last place in array
        conf = box[-1]
        if conf > threshold:
            # Convert float to int and multiply position of each box by x and y ratio
            (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box))

            # Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thickness
            cv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3)

            # Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_type
            if conf_labels:
                # Create background box based on annotation length
                (text_w, text_h), _ = cv2.getTextSize(
                    f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1
                )
                image_copy = rgb_image.copy()
                cv2.rectangle(
                    image_copy,
                    (x_min, y_min - text_h - 10),
                    (x_min + text_w, y_min - 10),
                    colors["white"],
                    -1,
                )
                # Add weighted image copy with white boxes under text
                cv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image)
                cv2.putText(
                    rgb_image,
                    f"{annotation}",
                    (x_min, y_min - 10),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.8,
                    colors["red"],
                    1,
                    cv2.LINE_AA,
                )

    return rgb_image

print("5 - Load Recognition Model: text-recognition-0014")

recognition_model = ie.read_model(
    model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin")
)

recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU")

recognition_output_layer = recognition_compiled_model.output(0)
recognition_input_layer = recognition_compiled_model.input(0)

# Get height and width of input layer
_, _, Hrecog, Wrecog = recognition_input_layer.shape

print("- Input of recognition model shape: {}".format(recognition_input_layer))
print("- Output of recognition model shape: {}".format(recognition_output_layer))

'''
模型推理
'''
# Calculate scale for image resizing
(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
ratio_x, ratio_y = real_x / resized_x, real_y / resized_y

# Convert image to grayscale for text recognition model
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Get dictionary to encode output, based on model documentation
letters = "~0123456789abcdefghijklmnopqrstuvwxyz"

# Prepare empty list for annotations
annotations = list()
cropped_images = list()
# fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True)
# For each crop, based on boxes given by detection model we want to get annotations
for i, crop in enumerate(boxes):
    # Get coordinates on corners of crop
    (x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop))
    image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog))

    # Run inference with recognition model
    result = recognition_compiled_model([image_crop])[recognition_output_layer]

    # Squeeze output to remove unnececery dimension
    recognition_results_test = np.squeeze(result)

    # Read annotation based on probabilities from output layer
    annotation = list()
    for letter in recognition_results_test:
        parsed_letter = letters[letter.argmax()]

        # 如果我们检测到数字,都需要-1
        if parsed_letter.isnumeric():
            parsed_letter = int(parsed_letter)
            parsed_letter = parsed_letter + 1
            if parsed_letter == 10:
                parsed_letter = 0
            parsed_letter = str(parsed_letter)
        # Returning 0 index from argmax signalises end of string
        if parsed_letter == letters[0]:
            continue
        annotation.append(parsed_letter)
    annotations.append("".join(annotation))
    cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max])
    cropped_images.append(cropped_image)

boxes_with_annotations = list(zip(boxes, annotations))

3 结果

我试了几张图片,其实效果一般,说实话,还没有Tesseract好。如下图:

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

破浪会有时

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值