openvino系列 15. OpenVINO OCR
此案例主要解释如何使用 OpenVINO OCR 模型进行字体检测(detection)和识别(recognition)。总体上尝试下来的,OpenVINO提供的OCR模块效果一般,因为这个模块只能识别数字和字母,遇到特殊字符会影响识别的精度,而且对于文字的角度与分辨率也有一定要求。
- 字体检测(detection)任务对应模型:horizontal-text-detection-0001。
- 字体识别(recognition)任务对应模型:text-recognition-0014。
环境描述:
- 本案例运行环境:Win10,10代i5笔记本
- IDE:VSCode
- openvino版本:2022.1
- 代码链接,
11-OCR
文章目录
1. 关于模型的使用
OpenVINO 的 Model Zoo 提供了很多预训练模型。
1.1 字体检测预训练模型
关于字体检测的模型,Model Zoo 提供了如下几个:
horizontal-text-detection-0001 | text-detection-0003 | text-detection-0004 | |
---|---|---|---|
说明 | based on FCOS architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2-like as a backbone | based on PixelLink architecture with MobileNetV2, depth_multiplier=1.4 as a backbone |
输入 | [1,3,704,704],对应 [1,C,H,W] | [1,768,1280,3],对应 [B,H,W,C] | [1,768,1280,3],对应 [B,H,W,C] |
输出1 | boxes :[N,5],其中 N 是检测到的边界框的数量。每个检测框格式为:[x_min,y_min,x_max,y_max,conf] | model/link_logits_/add :[1,192,320,16],logits related to linkage between pixels and their neighbors | model/link_logits_/add :[1,192,320,16],logits related to linkage between pixels and their neighbors |
输出2 | labels :[N],其中 N 是检测到的边界框的数量,在文本检测的情况下,每个检测到的框的值都等于0。 | model/segm_logits/add :[1,192,320,2],logits related to text/no-text classification for each pixel | model/segm_logits/add :[1,192,320,2],logits related to text/no-text classification for each pixel |
B - batch size;H - image height;W - image width;C - number of channels。
1.2 FCOS 回顾
horizontal-text-detection-0001
这个模型是通过FCOS训练而来的。这里我们对FCOS(Fully Convolutional One-Stage Object Detection)做一个简单的回顾。
FCOS是一个端到端的anchor-free one-stage 物体识别算法,网络结构如下图,由如下三部分组成:
- backbone网络;
- feature pyramid结构;
- 输出部分(classification/Regression/Center-ness);
根据FPN,我们在不同层次对特征图上检测不同尺寸的物体。具体来说,我们抽出五层特征图,分别定义为{ P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7}。 P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5 由主干CNN的特征图 C 3 C_3 C3, C 4 C_4 C4, C 5 C_5 C5 经过一个1x1卷积横向连接得到。 P 6 P_6 P6, P 7 P_7 P7 分别由 P 5 P_5 P5, P 6 P_6 P6 经过一个stride=2的卷积层得到。所以,最后我们得到的 P 3 P_3 P3, P 4 P_4 P4, P 5 P_5 P5, P 6 P_6 P6, P 7 P_7 P7 分别对应stride 8,16,32,64,128。
右侧的 Head 是 FCOS 的重点部分,可以看到每层 feature 被分为了两个分支,上面的分支用于做分类,下面的分支用于做目标框位置的回归。分类的分支还有一个 Center-ness 分支用于做中心点的预测。不同于传统的中心点 + 宽高或者坐标点的形式,FCOS 通过中心点和一个4D vector(l,t,r,b)来预测物体框的位置。
最后,注意一点,FCOS 中只要 feature map 某个位置的点落入 groundtruth 的 bbox 中就被认为是正样本,可见用于训练的正样本的数量将会非常的多。
Cost Function这里就不赘述了,我们只是在这里回顾一下 FCOS 算法的整体逻辑。
1.3 PixelLink 算法回顾
text-detection-0003
和text-detection-0004
背后的算法是基于PixelLink: Detecting Scene Text via Instance Segmentation。这里,我们对PixelLink做一个简单的回顾。
对于一般的基于深度学习的文字检测模型,其主要的实现步骤是判断是不是文本,并且给出文本框的位置和角度,如下图:
上一章节那个 FCOS
模型虽然不是专门检测文字的,但整体逻辑类似,都是最后有一个回归,一个分类。
PixelLink主要有两个部分:Pixel(像素)、Link(连接)。PixelLink主要是基于CNN网络,做某个像素(pixel)的文本/非文本的分类预测,以及该像素的8个邻域方向是否存在连接(link)的分类预测(即上图中虚线框内的八个热图,代表八个方向的连接预测)。
PixelLink网络结构的骨干(backbone)采用VGG16作为特征提取器,将最后的全连接层fc6、fc7替换为卷积层,特征融合和像素预测的方式基于FPN思想(feature pyramid network,金字塔特征网络),即卷积层的尺寸依次减半,但卷积核的数量依次增倍。该模型结构有两个独立的头,一个用于文本/非文本预测(Text/non-text Prediction),另一个用于连接预测(Link Prediction),这两者都使用了Softmax,输出1x2=2通道(文本/非文本的分类)和8x2=16通道(8个邻域方向是否有连接的分类)。
1.4 字体识别预训练模型
关于字体识别的模型,Model Zoo 提供了如下几个:
text-recognition-0012 | text-recognition-0014 | text-recognition-resnet-fc | |
---|---|---|---|
说明 | VGG16-like backbone and bidirectional LSTM encoder-decoder | ResNext101-like backbone (stage-1-2) and bidirectional LSTM encoder-decoder. | model based on ResNet with Fully Connected text recognition head |
Accuracy in ICDAR13 Dataset | 0.8818 | 0.8887 | 92.96% |
输入 | [1,32,120,1],对应 [B,H,W,C] | [1,1,32,128],对应 [B,C,H,W] | [1,1,32,100],对应 [B,C,H,W] |
注意 | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. | source image should be tight aligned crop with detected text converted to grayscale. Mean values: [127.5, 127.5, 127.5], scale factor for each channel: 127.5. |
输出 | boxes :[30,1,37],对应[W,B,L],L的顺序:0123456789abcdefghijklmnopqrstuvwxyz# | [16,1,37],对应[W,B,L],L的顺序:#0123456789abcdefghijklmnopqrstuvwxyz | [1,26,37],对应[B,W,L],L的顺序:[s]0123456789abcdefghijklmnopqrstuvwxyz |
B - batch size;H - image height;W - image width;C - number of channels;W:output sequence length;L:confidence distribution across alphanumeric symbols。
1.5 最终选择
最终我们选择:
- 字体检测(detection)任务对应模型:horizontal-text-detection-0001。
- 字体识别(recognition)任务对应模型:text-recognition-0014。
2. 代码
2.1 下载模型
首先,和其他模型一样,我们还是先下载模型。
import shutil
import sys
from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import Markdown, display
from PIL import Image
from openvino.runtime import Core
from yaspin import yaspin
import numpy
from PIL import Image, ImageOps
ie = Core()
model_dir = Path("model")
precision = "FP16"
detection_model = "horizontal-text-detection-0001"
recognition_model = "text-recognition-0014"
#base_model_dir = Path("~/open_model_zoo_models").expanduser()
base_model_dir = Path("./model/open_model_zoo_models").expanduser()
#omz_cache_dir = Path("~/open_model_zoo_cache").expanduser()
omz_cache_dir = Path("./model/open_model_zoo_cache").expanduser()
model_dir.mkdir(exist_ok=True)
'''
下载模型
'''
print("1 - Download text detection model: horizontal-text-detection-0001, and text recognition model: text-recognition-0014 from Open Model Zoo. Both models are already in IR format.")
ir_path_detection_model = Path(f"{base_model_dir}/intel/{detection_model}/{precision}/{detection_model}.xml")
ir_path_recognition_model = Path(f"{base_model_dir}/intel/{recognition_model}/{precision}/{recognition_model}.xml")
if not ir_path_detection_model.exists() and ir_path_recognition_model.exists():
download_command = f"omz_downloader " \
f"--name {detection_model},{recognition_model} " \
f"--output_dir {base_model_dir} " \
f"--cache_dir {omz_cache_dir} " \
f"--precision {precision}"
display(Markdown(f"Download command: `{download_command}`"))
with yaspin(text=f"Downloading {detection_model}, {recognition_model}") as sp:
download_result = !$download_command
print(download_result)
sp.text = f"Finished downloading {detection_model}, {recognition_model}"
sp.ok("✔")
else:
print("IR model already exists.")
2.2 字体检测模型
- 加载检测模型:
horizontal-text-detection-0001
; - 加载图像,并调整其尺寸使之和模型的输入尺寸吻合;
- 模型推理,并返回检测推理结果。
首先,我们加载检测模型,并且看一下这个模型的输入输出:
print("2 - Load detection Model: horizontal-text-detection-0001")
detection_model = ie.read_model(
model=ir_path_detection_model, weights=ir_path_detection_model.with_suffix(".bin")
)
detection_compiled_model = ie.compile_model(model=detection_model, device_name="CPU")
detection_input_layer = detection_compiled_model.input(0)
detection_output_layer_box = detection_compiled_model.output('boxes')
detection_output_layer_label = detection_compiled_model.output('labels')
print("- Input of detection model shape: {}".format(detection_input_layer))
print("- Output `box` of detection model shape: {}".format(detection_output_layer_box))
print("- Output `label` of detection model shape: {}".format(detection_output_layer_label))
Terminal打印:
2 - Load detection Model.
- Input of detection model shape: <ConstOutput: names[image] shape{1,3,704,704} type: f32>
- Output `box` of detection model shape: <ConstOutput: names[boxes] shape{..100,5} type: f32>
- Output `label` of detection model shape: <ConstOutput: names[labels] shape{..100} type: i64>
接下来,我们导入图片,并调整其尺寸使之和模型的输入尺寸吻合。
print("3 - Load Image and resize into model input shape.")
# Read the image
image = cv2.imread("data/label4.png")
print("- Input image size: {}".format(image.shape))
# N,C,H,W = batch size, number of channels, height, width
N, C, H, W = detection_input_layer.shape
# Resize image to meet network expected input sizes
resized_image = cv2.resize(image, (W, H))
# Reshape to network input shape
input_image = np.expand_dims(resized_image.transpose(2, 0, 1), 0)
print("- Input image is resized (with padding) into: {}".format(input_image.shape))
plt.imshow(cv2.cvtColor(resized_image, cv2.COLOR_BGR2RGB));
Terminal打印:
3 - Load Image and resize into model input shape.
- Input image size: (256, 644, 3)
- Input image is resized (with padding) into: (1, 3, 704, 704)
模型推理的代码如下:
'''
### 模型推理
在图像中检测到文本框并以`[100, 5]`形状的数据块形式返回。每个检测描述的格式为 `[x_min, y_min, x_max, y_max, conf]`。
'''
print("4 - Detection model inference.")
output_key = detection_compiled_model.output("boxes")
boxes = detection_compiled_model([input_image])[output_key]
# Remove zero only boxes
boxes = boxes[~np.all(boxes == 0, axis=1)]
print("- Detect {} boxes.".format(boxes.shape[0]))
Terminal打印:
4 - Detection model inference.
- Detect 4 boxes.
2.3 字体识别模型
文字识别模型和文字检测模型导入和推理的步骤是类似的,这里我们就直接上代码了:
def multiply_by_ratio(ratio_x, ratio_y, box):
return [
max(shape * ratio_y, 10) if idx % 2 else shape * ratio_x
for idx, shape in enumerate(box[:-1])
]
def run_preprocesing_on_crop(crop, net_shape):
temp_img = cv2.resize(crop, net_shape)
temp_img = temp_img.reshape((1,) * 2 + temp_img.shape)
return temp_img
def convert_result_to_image(bgr_image, resized_image, boxes, threshold=0.3, conf_labels=True):
# Define colors for boxes and descriptions
colors = {"red": (255, 0, 0), "green": (0, 255, 0), "white": (255, 255, 255)}
# Fetch image shapes to calculate ratio
(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
ratio_x, ratio_y = real_x / resized_x, real_y / resized_y
# Convert base image from bgr to rgb format
rgb_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2RGB)
# Iterate through non-zero boxes
for box, annotation in boxes:
# Pick confidence factor from last place in array
conf = box[-1]
if conf > threshold:
# Convert float to int and multiply position of each box by x and y ratio
(x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, box))
# Draw box based on position, parameters in rectangle function are: image, start_point, end_point, color, thickness
cv2.rectangle(rgb_image, (x_min, y_min), (x_max, y_max), colors["green"], 3)
# Add text to image based on position and confidence, parameters in putText function are: image, text, bottomleft_corner_textfield, font, font_scale, color, thickness, line_type
if conf_labels:
# Create background box based on annotation length
(text_w, text_h), _ = cv2.getTextSize(
f"{annotation}", cv2.FONT_HERSHEY_TRIPLEX, 0.8, 1
)
image_copy = rgb_image.copy()
cv2.rectangle(
image_copy,
(x_min, y_min - text_h - 10),
(x_min + text_w, y_min - 10),
colors["white"],
-1,
)
# Add weighted image copy with white boxes under text
cv2.addWeighted(image_copy, 0.4, rgb_image, 0.6, 0, rgb_image)
cv2.putText(
rgb_image,
f"{annotation}",
(x_min, y_min - 10),
cv2.FONT_HERSHEY_SIMPLEX,
0.8,
colors["red"],
1,
cv2.LINE_AA,
)
return rgb_image
print("5 - Load Recognition Model: text-recognition-0014")
recognition_model = ie.read_model(
model=ir_path_recognition_model, weights=ir_path_recognition_model.with_suffix(".bin")
)
recognition_compiled_model = ie.compile_model(model=recognition_model, device_name="CPU")
recognition_output_layer = recognition_compiled_model.output(0)
recognition_input_layer = recognition_compiled_model.input(0)
# Get height and width of input layer
_, _, Hrecog, Wrecog = recognition_input_layer.shape
print("- Input of recognition model shape: {}".format(recognition_input_layer))
print("- Output of recognition model shape: {}".format(recognition_output_layer))
'''
模型推理
'''
# Calculate scale for image resizing
(real_y, real_x), (resized_y, resized_x) = image.shape[:2], resized_image.shape[:2]
ratio_x, ratio_y = real_x / resized_x, real_y / resized_y
# Convert image to grayscale for text recognition model
grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# Get dictionary to encode output, based on model documentation
letters = "~0123456789abcdefghijklmnopqrstuvwxyz"
# Prepare empty list for annotations
annotations = list()
cropped_images = list()
# fig, ax = plt.subplots(len(boxes), 1, figsize=(5,15), sharex=True, sharey=True)
# For each crop, based on boxes given by detection model we want to get annotations
for i, crop in enumerate(boxes):
# Get coordinates on corners of crop
(x_min, y_min, x_max, y_max) = map(int, multiply_by_ratio(ratio_x, ratio_y, crop))
image_crop = run_preprocesing_on_crop(grayscale_image[y_min:y_max, x_min:x_max], (Wrecog, Hrecog))
# Run inference with recognition model
result = recognition_compiled_model([image_crop])[recognition_output_layer]
# Squeeze output to remove unnececery dimension
recognition_results_test = np.squeeze(result)
# Read annotation based on probabilities from output layer
annotation = list()
for letter in recognition_results_test:
parsed_letter = letters[letter.argmax()]
# 如果我们检测到数字,都需要-1
if parsed_letter.isnumeric():
parsed_letter = int(parsed_letter)
parsed_letter = parsed_letter + 1
if parsed_letter == 10:
parsed_letter = 0
parsed_letter = str(parsed_letter)
# Returning 0 index from argmax signalises end of string
if parsed_letter == letters[0]:
continue
annotation.append(parsed_letter)
annotations.append("".join(annotation))
cropped_image = Image.fromarray(image[y_min:y_max, x_min:x_max])
cropped_images.append(cropped_image)
boxes_with_annotations = list(zip(boxes, annotations))
3 结果
我试了几张图片,其实效果一般,说实话,还没有Tesseract好。如下图: