OCR系统接口设计
一: 学习接口设计相关
接口设计六大原则(SOLID)
- 单一职责
- 开闭原则
- 里氏替换原则
- 迪米特法则(最少知道法则)
- 接口隔离原则
- 依赖倒置原则
二:socket的使用
三:要返回的值及其类型
3.1什么是json
3.2中英文模型识别输出
3.2.1检测+方向分类器+识别全流程
- 结果是一个list,每个item包含了文本框,文字和识别置信度
3.2.2单独使用检测
- 结果是一个list,每个item只包含文本框
3.2.3单独使用识别
- 结果是一个list,每个item只包含识别结果和识别置信度
3.3版面分析输出
- 对文档图片中的文字、标题、列表、图片和表格5类区域进行划分
- 对于前三类区域,直接使用OCR模型完成对应区域文字检测与识别,并将结果保存在txt中
- 对于表格类区域,经过表格结构化处理后,表格图片转换为相同表格样式的Excel文件
- 图片区域会被单独裁剪成图像
PP-Structure的返回结果为一个dict组成的list,示例如下
[
{ 'type': 'Text',
'bbox': [34, 432, 345, 462],
'res': ([[36.0, 437.0, 341.0, 437.0, 341.0, 446.0, 36.0, 447.0], [41.0, 454.0, 125.0, 453.0, 125.0, 459.0, 41.0, 460.0]],
[('Tigure-6. The performance of CNN and IPT models using difforen', 0.90060663), ('Tent ', 0.465441)])
}
]
实际输出:
# Table
{'type': 'Table',
'bbox': [21, 108, 544, 518],
'res':
'<html><body><table><thead><tr><td colspan="7">前十名股东持股情况</td></tr><tr><td rowspan="2">股东名称</td><td rowspan="2">期末持股数量</td><td rowspan="2">比例</td><td rowspan="2">持有有限售条 件股份数量</td><td colspan="2">质押或冻结情况</td><td rowspan="2">股东性质</td></tr><tr><td>股份状态</td><td>数量</td></tr></thead><tbody><tr><td>成都交子金融控股集团 有限公司</td><td>0180613</td><td></td><td>000</td><td></td><td></td><td>国有法人</td></tr><tr><td>HONGLENGBANK Berhad</td><td>00000</td><td></td><td></td><td></td><td></td><td>境外法人</td></tr><tr><td>防海产业投资基金管理 有限公司</td><td>240000000</td><td>.64419:</td><td>240000000</td><td></td><td></td><td>内非国有法人</td></tr><tr><td>成都工投资产经营有限 公司</td><td>0242</td><td>4999796</td><td>o</td><td></td><td></td><td>国有法人</td></tr><tr><td>北京能源集团有限责任 公司</td><td>10000</td><td>4.429496</td><td>1000</td><td></td><td></td><td>国有法人</td></tr><tr><td>成都欣天颐投资有限责 任公司</td><td>4000</td><td>4381%</td><td>o</td><td></td><td></td><td>国有法人</td></tr><tr><td>上海东昌投资发展有限 公司</td><td>1200</td><td>32209</td><td>o)</td><td>陈结</td><td>50.000.00境内非国有法人</td><td></td></tr><tr><td>新华文轩出版传媒股份 有限公司</td><td>80</td><td>227</td><td>o</td><td></td><td></td><td>国有法人</td></tr><tr><td>四川新华发行集团有限 公司</td><td>800</td><td>9723</td><td></td><td></td><td></td><td>国有法人</td></tr><tr><td>成都市协成资产管理有 限责任公司</td><td>900</td><td>998</td><td></td><td></td><td></td><td>国有法人</td></tr></table></body></html>'
}
# Title
{'type': 'Title',
'bbox': [43, 3, 523, 52],
'res': ([
[73.0, 11.0, 515.0, 11.0, 515.0, 21.0, 73.0, 21.0], [47.0, 37.0, 143.0, 37.0, 143.0, 50.0, 47.0, 50.0]
],
[
('2.2截至报香期末的香通股股东总数,前十名普通股股东、前十名无限售条件的普进', 0.8316974), ('股股东的持股情况', 0.9863705)
]
)
}
# Text
{'type': 'Text',
'bbox': [24, 86, 94, 101],
'res': ([[27.0, 89.0, 88.0, 90.0, 87.0, 100.0, 26.0, 99.0]],
[('投东总数(户', 0.8357944)])
}
# Figure
# 注意下面的并不是图像的像素值,是文字的坐标box, 图片的ROI
{'type': 'Figure',
'bbox': [21, 0, 542, 517],
'res': ([[71.0, 7.0, 516.0, 7.0, 516.0, 20.0, 71.0, 20.0], [47.0, 36.0, 145.0, 36.0, 145.0, 50.0, 47.0, 50.0], [428.0, 64.0, 472.0, 64.0, 472.0, 79.0, 428.0, 79.0], [25.0, 87.0, 95.0, 87.0, 95.0, 101.0, 25.0, 101.0], [506.0, 87.0, 541.0, 87.0, 541.0, 102.0, 506.0, 102.0], [236.0, 111.0, 331.0, 111.0, 331.0, 125.0, 236.0, 125.0], [259.0, 140.0, 327.0, 137.0, 328.0, 154.0, 260.0, 157.0], [356.0, 135.0, 434.0, 135.0, 434.0, 152.0, 356.0, 152.0], [59.0, 148.0, 102.0, 148.0, 102.0, 164.0, 59.0, 164.0], [140.0, 147.0, 205.0, 150.0, 205.0, 165.0, 140.0, 162.0], [220.0, 148.0, 246.0, 148.0, 246.0, 164.0, 220.0, 164.0], [265.0, 157.0, 322.0, 157.0, 322.0, 171.0, 265.0, 171.0], [481.0, 149.0, 525.0, 149.0, 525.0, 165.0, 481.0, 165.0], [335.0, 161.0, 382.0, 161.0, 382.0, 176.0, 335.0, 176.0], [416.0, 160.0, 439.0, 160.0, 439.0, 176.0, 416.0, 176.0], [25.0, 184.0, 131.0, 184.0, 131.0, 197.0, 25.0, 197.0], [147.0, 191.0, 260.0, 191.0, 260.0, 205.0, 147.0, 205.0], [263.0, 191.0, 324.0, 191.0, 324.0, 205.0, 263.0, 205.0], [480.0, 188.0, 526.0, 190.0, 525.0, 208.0, 479.0, 206.0], [26.0, 201.0, 69.0, 201.0, 69.0, 212.0, 26.0, 212.0], [25.0, 217.0, 110.0, 217.0, 110.0, 231.0, 25.0, 231.0], [148.0, 223.0, 264.0, 223.0, 264.0, 237.0, 148.0, 237.0], [258.0, 224.0, 324.0, 224.0, 324.0, 238.0, 258.0, 238.0], [482.0, 224.0, 525.0, 224.0, 525.0, 239.0, 482.0, 239.0], [26.0, 234.0, 61.0, 234.0, 61.0, 245.0, 26.0, 245.0], [25.0, 250.0, 133.0, 250.0, 133.0, 264.0, 25.0, 264.0], [148.0, 258.0, 206.0, 258.0, 206.0, 270.0, 148.0, 270.0], [217.0, 258.0, 260.0, 258.0, 260.0, 270.0, 217.0, 270.0], [264.0, 258.0, 324.0, 258.0, 324.0, 270.0, 264.0, 270.0], [25.0, 266.0, 70.0, 266.0, 70.0, 280.0, 25.0, 280.0], [464.0, 257.0, 539.0, 257.0, 539.0, 272.0, 464.0, 272.0], [25.0, 283.0, 134.0, 283.0, 134.0, 297.0, 25.0, 297.0], [216.0, 291.0, 258.0, 291.0, 258.0, 305.0, 216.0, 305.0], [317.0, 287.0, 330.0, 293.0, 324.0, 304.0, 312.0, 298.0], [150.0, 292.0, 206.0, 292.0, 206.0, 303.0, 150.0, 303.0], [26.0, 299.0, 46.0, 299.0, 46.0, 311.0, 26.0, 311.0], [482.0, 291.0, 525.0, 291.0, 525.0, 306.0, 482.0, 306.0], [25.0, 317.0, 133.0, 317.0, 133.0, 330.0, 25.0, 330.0], [216.0, 324.0, 260.0, 324.0, 260.0, 338.0, 216.0, 338.0], [265.0, 323.0, 325.0, 323.0, 325.0, 337.0, 265.0, 337.0], [147.0, 325.0, 207.0, 323.0, 207.0, 337.0, 148.0, 339.0], [27.0, 331.0, 48.0, 334.0, 46.0, 347.0, 25.0, 344.0], [482.0, 324.0, 525.0, 324.0, 525.0, 339.0, 482.0, 339.0], [25.0, 350.0, 131.0, 350.0, 131.0, 364.0, 25.0, 364.0], [216.0, 356.0, 258.0, 356.0, 258.0, 371.0, 216.0, 371.0], [147.0, 357.0, 208.0, 357.0, 208.0, 372.0, 147.0, 372.0], [314.0, 351.0, 329.0, 351.0, 329.0, 376.0, 314.0, 376.0], [25.0, 363.0, 60.0, 366.0, 59.0, 381.0, 24.0, 378.0], [482.0, 357.0, 525.0, 357.0, 525.0, 373.0, 482.0, 373.0], [24.0, 383.0, 131.0, 384.0, 131.0, 398.0, 24.0, 397.0], [318.0, 390.0, 328.0, 390.0, 328.0, 405.0, 318.0, 405.0], [216.0, 391.0, 257.0, 391.0, 257.0, 405.0, 216.0, 405.0], [347.0, 390.0, 370.0, 390.0, 370.0, 407.0, 347.0, 407.0], [150.0, 393.0, 206.0, 393.0, 206.0, 404.0, 150.0, 404.0], [409.0, 391.0, 538.0, 391.0, 538.0, 405.0, 409.0, 405.0], [26.0, 401.0, 45.0, 401.0, 45.0, 413.0, 26.0, 413.0], [24.0, 416.0, 131.0, 417.0, 131.0, 431.0, 24.0, 430.0], [215.0, 424.0, 258.0, 424.0, 258.0, 438.0, 215.0, 438.0], [153.0, 425.0, 209.0, 425.0, 209.0, 439.0, 153.0, 439.0], [480.0, 421.0, 526.0, 423.0, 525.0, 442.0, 479.0, 439.0], [24.0, 433.0, 69.0, 431.0, 70.0, 445.0, 25.0, 448.0], [25.0, 450.0, 132.0, 450.0, 132.0, 463.0, 25.0, 463.0], [215.0, 457.0, 259.0, 457.0, 259.0, 472.0, 215.0, 472.0], [153.0, 459.0, 209.0, 459.0, 209.0, 471.0, 153.0, 471.0], [480.0, 456.0, 526.0, 456.0, 526.0, 475.0, 480.0, 475.0], [26.0, 468.0, 45.0, 468.0, 45.0, 480.0, 26.0, 480.0], [25.0, 484.0, 132.0, 484.0, 132.0, 498.0, 25.0, 498.0], [216.0, 491.0, 260.0, 491.0, 260.0, 506.0, 216.0, 506.0], [154.0, 493.0, 206.0, 493.0, 206.0, 504.0, 154.0, 504.0], [26.0, 501.0, 78.0, 501.0, 78.0, 512.0, 26.0, 512.0], [481.0, 492.0, 525.0, 492.0, 525.0, 506.0, 481.0, 506.0]],
[('2.2截至报告期末的普通股股东总数,前十名普通股股东、前十名无限售条件的普通', 0.8929791), ('股股东的持股情况', 0.97501135), ('单位:股', 0.9774755), ('服东总数(户)', 0.92717403), ('103', 0.6561246), ('前十名股东持股情况', 0.96094346), ('持有有限售条', 0.9304914), ('质押或冻结情况', 0.9201856), ('股东名称', 0.93884647), ('期末持股数量', 0.9409886), ('比例', 0.9798088), ('件股份数量', 0.9649742), ('股东性质', 0.97255427), ('股份状态', 0.9820739), ('数量', 0.9324105), ('成都交子金融控股集团', 0.9262026), ('0613%', 0.52594924), ('1000', 0.5757624), ('国有法人', 0.97610223), ('有限公司', 0.99775326), ('HONG LEONG BANK', 0.90959585), ('650.000.00017.9943%', 0.64369446), ('650000', 0.5386675), ('境外法人', 0.9488512), ('BERHAD', 0.82195807), ('淄海产业投资基金管理', 0.8352188), ('240000000', 0.66087), ('6.6441%', 0.8387557), ('20000', 0.58459437), ('有限公司', 0.99939734), ('境内非国有法人', 0.9843875), ('成都工投资产经营有限', 0.99066544), ('4.99979', 0.68861526), ('0', 0.63635033), ('60.242', 0.5900027), ('公司', 0.99932647), ('国有法人', 0.97825503), ('北京能源集团有限责任', 0.9911963), ('4.4294%', 0.8332702), ('16000000', 0.55784523), ('160000000', 0.75301075), ('公司', 0.99832696), ('国有法人', 0.98357177), ('成都欣天颐投资有限责', 0.98289585), ('3.438196', 0.77821505), ('194000', 0.6864781), ('0', 0.12892546), ('任公司', 0.9738032), ('国有法人', 0.9918393), ('上海东昌投资发展有限', 0.98842794), ('0', 0.1811895), ('3.3220%', 0.8148122), ('陈结', 0.83211), ('10000000', 0.5448251), ('50.000.000境内非国有法人', 0.909541), ('公司', 0.9938568), ('新华文轩出版传媒股份', 0.979272), ('2.2147%', 0.8920428), ('800000', 0.5317555), ('国有法人', 0.9593512), ('有限公司', 0.99910784), ('四川新华发行集团有限', 0.99318725), ('1.97239', 0.73414296), ('4800', 0.6891037), ('国有法人', 0.98005706), ('公司', 0.9896196), ('成都市协成资产管理有', 0.9903468), ('1.96989', 0.7495075), ('115900', 0.6064301), ('限责任公司', 0.9836632), ('国有法人', 0.9811094)])
}
3.3关键在于怎样融合?什么时候使用什么数据
可参考下面代码,看看怎么个思路:
def draw_structure_result(image, result, font_path):
# 判断image是否为 np.ndarray类型,是则将array转换为image
# image是识别的原始图片
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
boxes, txts, scores = [], [], []
# 在result中处理除Table外的其他类型,包括Title、Figure、Text
for region in result:
if region['type'] == 'Table':
pass
else:
# zip()将res[0]:box坐标 和 res[1]识别结果 打包成元组
for box, rec_res in zip(region['res'][0], region['res'][1]):
# box的形式
# [
# [73.0, 11.0, 515.0, 11.0, 515.0, 21.0, 73.0, 21.0],
# [47.0, 37.0, 143.0, 37.0, 143.0, 50.0, 47.0, 50.0]
# ]
# 转换后的box全部放到boxes中
boxes.append(np.array(box).reshape(-1, 2))
# 处理后
# [
# [ 73. 11.]
# [515. 11.]
# [515. 21.]
# [ 73. 21.]
# [ 47. 37.]
# [143. 37.]
# [143. 50.]
# [ 47. 50.]
# ]
# rec_res的形式(不太准确,但是意思是这样,可能zip把其转换成了list):
# [
# ('2.2截至报香期末的香通股股东总数,前十名普通股股东、前十名无限售条件的普进', 0.8316974),
# ('股股东的持股情况', 0.9863705)
# ]
# 转换后的文字全都放到txts列表中,注意此时是有对应关系的:4个点对应一个文本框
txts.append(rec_res[0])
# rec_res[1]对应准确率
scores.append(rec_res[1])
im_show = draw_ocr_box_txt(
image, boxes, txts, scores, font_path=font_path, drop_score=0)
return im_show
def draw_ocr_box_txt(image,
boxes,
txts,
scores=None,
drop_score=0.5,
font_path="./doc/simfang.ttf"):
h, w = image.height, image.width
img_left = image.copy()
img_right = Image.new('RGB', (w, h), (255, 255, 255))
import random
random.seed(0)
draw_left = ImageDraw.Draw(img_left)
draw_right = ImageDraw.Draw(img_right)
for idx, (box, txt) in enumerate(zip(boxes, txts)):
if scores is not None and scores[idx] < drop_score:
continue
color = (random.randint(0, 255), random.randint(0, 255),
random.randint(0, 255))
draw_left.polygon(box, fill=color)
draw_right.polygon(
[
box[0][0], box[0][1], box[1][0], box[1][1], box[2][0],
box[2][1], box[3][0], box[3][1]
],
outline=color)
box_height = math.sqrt((box[0][0] - box[3][0])**2 + (box[0][1] - box[3][
1])**2)
box_width = math.sqrt((box[0][0] - box[1][0])**2 + (box[0][1] - box[1][
1])**2)
if box_height > 2 * box_width:
font_size = max(int(box_width * 0.9), 10)
font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
cur_y = box[0][1]
for c in txt:
char_size = font.getsize(c)
draw_right.text(
(box[0][0] + 3, cur_y), c, fill=(0, 0, 0), font=font)
cur_y += char_size[1]
else:
font_size = max(int(box_height * 0.8), 10)
font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
draw_right.text(
[box[0][0], box[0][1]], txt, fill=(0, 0, 0), font=font)
img_left = Image.blend(image, img_left, 0.5)
img_show = Image.new('RGB', (w * 2, h), (255, 255, 255))
img_show.paste(img_left, (0, 0, w, h))
img_show.paste(img_right, (w, 0, w * 2, h))
return np.array(img_show)
看完上面也就理解为什么画出来的result.jpg图像会有重叠的部分了
因为有重复的框
怎么解决?
怎样只取一个结果?
修改代码,使其只返回Figure的结果
def draw_structure_result(image, result, font_path):
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
boxes, txts, scores = [], [], []
for region in result:
if region['type'] == 'Table':
pass
if region['type'] == 'Figure':
for box, rec_res in zip(region['res'][0], region['res'][1]):
boxes.append(np.array(box).reshape(-1, 2))
txts.append(rec_res[0])
scores.append(rec_res[1])
else:
pass
im_show = draw_ocr_box_txt(
image, boxes, txts, scores, font_path=font_path, drop_score=0)
return im_show
原:
现:
画图能够排序的原因:
- 因为画图的时候是按照文本框的位置直接画上去的
- 并没有对box位置的逻辑判断
· - 下一步怎样逻辑判断框的位置,按照一定规律排序?
- 按照坐标排序,还是原本就有坐标顺序?
[
# 起————————————————————————————————————>
# |
# |
# |
# |
# ∨
# 终<————————————————————————————————————
# 左上角 右上角 右下角 左下角
[71.0, 7.0, 516.0, 7.0, 516.0, 20.0, 71.0, 20.0],
('2.2截至报告期末的普通股股东总数,前十名普通股股东、前十名无限售条件的普通', 0.8929791)
[47.0, 36.0, 145.0, 36.0, 145.0, 50.0, 47.0, 50.0],
('股股东的持股情况', 0.97501135)
[428.0, 64.0, 472.0, 64.0, 472.0, 79.0, 428.0, 79.0],
('单位:股', 0.9774755)
[25.0, 87.0, 95.0, 87.0, 95.0, 101.0, 25.0, 101.0],
('服东总数(户)', 0.92717403)
[506.0, 87.0, 541.0, 87.0, 541.0, 102.0, 506.0, 102.0],
('103', 0.6561246)
[236.0, 111.0, 331.0, 111.0, 331.0, 125.0, 236.0, 125.0],
('前十名股东持股情况', 0.96094346)
[259.0, 140.0, 327.0, 137.0, 328.0, 154.0, 260.0, 157.0],
('持有有限售条', 0.9304914)
[356.0, 135.0, 434.0, 135.0, 434.0, 152.0, 356.0, 152.0],
('质押或冻结情况', 0.9201856)
[59.0, 148.0, 102.0, 148.0, 102.0, 164.0, 59.0, 164.0],
('股东名称', 0.93884647)
[140.0, 147.0, 205.0, 150.0, 205.0, 165.0, 140.0, 162.0],
('期末持股数量', 0.9409886)
[220.0, 148.0, 246.0, 148.0, 246.0, 164.0, 220.0, 164.0],
('比例', 0.9798088)
[265.0, 157.0, 322.0, 157.0, 322.0, 171.0, 265.0, 171.0],
('件股份数量', 0.9649742)
[481.0, 149.0, 525.0, 149.0, 525.0, 165.0, 481.0, 165.0],
('股东性质', 0.97255427)
[335.0, 161.0, 382.0, 161.0, 382.0, 176.0, 335.0, 176.0],
('股份状态', 0.9820739)
[416.0, 160.0, 439.0, 160.0, 439.0, 176.0, 416.0, 176.0],
('数量', 0.9324105)
[25.0, 184.0, 131.0, 184.0, 131.0, 197.0, 25.0, 197.0],
('成都交子金融控股集团', 0.9262026)
......
]
通过以上基本确定:
- box和识别结果是按照从上往下、从左到右来排列的
- 且是一 一对应的
怎样利用?
-
做怎样的处理知道换行了?
-
比较竖排坐标竖排坐标不同不同行,不可:有的框是斜的
竖排坐标相同
·
- 但是,是不是还能分清层次?
- 识别的时候怎样排出来的顺序?
- 比较竖排坐标
- 竖排坐标不同
- 不同行,在txt对应元素里面添加append(\t)
- 竖排坐标相同
- 竖排坐标不同
def return_box_txt_inorder(boxes, txts, scores=None, drop_score=0.5)
for idx, (box, txt) in enumerate(zip(boxes, txts)):
if scores is not None and scores[idx] < drop_score:
continue