问题
在做一个项目的时候,使用PaddleOCR提供的模型,实现对图片或者pdf进行面板恢复,并保存为.docx文档。但是,官方的文档只提供了针对图片进行面板恢复的python脚本,没有提供pdf进行面板恢复的python脚本,官方只提供了pdf面板恢复的命令行使用方法,因此,我去看了PaddleOCR的源码,将命令行方法转换为python脚本
准备工作
环境配置和文档请参考:\ppstructure/docs/quickstart.md · PaddlePaddle/PaddleOCR - Gitee.com
环境配置流程
1.安装Paddle(根据需求选择GPU或者CPU,其中之一即可)
# GPU
python3 -m pip install "paddlepaddle-gpu" -i https://mirror.baidu.com/pypi/simple
# CPU
python3 -m pip install "paddlepaddle" -i https://mirror.baidu.com/pypi/simple
2.安装PaddleOCR
# 安装paddleocr
!pip3 install "paddleocr>=2.6.0.3"
3.配置面板恢复环境
!git clone https://gitee.com/paddlepaddle/PaddleOCR
!python3 -m pip install -r PaddleOCR/ppstructure/recovery/requirements.txt
代码(pdf面板恢复python脚本)
import os
import cv2
import numpy as np
from paddle.utils import try_import
from paddleocr import PPStructure,save_structure_res
def recovery(img_path,output):
fitz = try_import("fitz")
from PIL import Image
imgs = []
with fitz.open(img_path) as pdf:
for pg in range(0, pdf.page_count):
page = pdf[pg]
mat = fitz.Matrix(2, 2)
pm = page.get_pixmap(matrix=mat, alpha=False)
# if width or height > 2000 pixels, don't enlarge the image
if pm.width > 2000 or pm.height > 2000:
pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
imgs.append(img)
flag_gif=False
flag_pdf=True
img_name = os.path.basename(img_path).split(".")[0]
img_paths = []
for index, pdf_img in enumerate(imgs):
os.makedirs(os.path.join(output, img_name), exist_ok=True)
pdf_img_path = os.path.join(
output, img_name, img_name + "_" + str(index) + ".jpg"
)
cv2.imwrite(pdf_img_path, pdf_img)
img_paths.append([pdf_img_path, pdf_img])
all_res = []
engine=PPStructure(recovery=True)
for index, (new_img_path, imgs) in enumerate(img_paths):
print("processing {}/{} page:".format(index + 1, len(img_paths)))
result = engine(imgs, img_idx=index)
save_structure_res(result, output, img_name, index)
from copy import deepcopy
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
h, w, _ = imgs.shape
result_cp = deepcopy(result)
result_sorted = sorted_layout_boxes(result_cp, w)
all_res += result_sorted
from paddleocr.ppstructure.recovery.recovery_to_doc import convert_info_docx
convert_info_docx(img, all_res, output, img_name)
recovery("work/PaddleOCR/ppstructure/docs/recovery/UnrealText.pdf","./output/")
注意
- output参数:最后需要有"/",不能为"./output",必须为"./output/",否则会报错
- 图片的面板恢复python脚本、图片和pdf面板恢复的命令行方式在ppstructure/recovery/README_ch.md · PaddlePaddle/PaddleOCR - Gitee.com中都有
- 上述代码参考pdf面板恢复的命令行方法,在git下来的仓库的PaddleOCR\paddleocr.py中
- 我是在ai studio的工作台运行的代码,python=3.10