PaddleOCR面板恢复python脚本--针对pdf的面板恢复

最新推荐文章于 2024-09-28 09:00:00 发布

zsh669

最新推荐文章于 2024-09-28 09:00:00 发布

阅读量1.2k

点赞数 12

文章标签： paddlepaddle ocr 百度 python pdf

本文链接：https://blog.csdn.net/weixin_62320093/article/details/140215783

版权

问题

在做一个项目的时候，使用PaddleOCR提供的模型，实现对图片或者pdf进行面板恢复，并保存为.docx文档。但是，官方的文档只提供了针对图片进行面板恢复的python脚本，没有提供pdf进行面板恢复的python脚本，官方只提供了pdf面板恢复的命令行使用方法，因此，我去看了PaddleOCR的源码，将命令行方法转换为python脚本

准备工作

环境配置和文档请参考：\ppstructure/docs/quickstart.md · PaddlePaddle/PaddleOCR - Gitee.com

环境配置流程

1.安装Paddle（根据需求选择GPU或者CPU，其中之一即可）

# GPU
python3 -m pip install "paddlepaddle-gpu" -i https://mirror.baidu.com/pypi/simple

# CPU
python3 -m pip install "paddlepaddle" -i https://mirror.baidu.com/pypi/simple

2.安装PaddleOCR

# 安装paddleocr
!pip3 install "paddleocr>=2.6.0.3"

3.配置面板恢复环境

!git clone https://gitee.com/paddlepaddle/PaddleOCR
!python3 -m pip install -r PaddleOCR/ppstructure/recovery/requirements.txt

代码（pdf面板恢复python脚本）

import os
import cv2
import numpy as np
from paddle.utils import try_import
from paddleocr import PPStructure,save_structure_res


def recovery(img_path,output):

    fitz = try_import("fitz")
    from PIL import Image

    imgs = []
    with fitz.open(img_path) as pdf:
        for pg in range(0, pdf.page_count):
            page = pdf[pg]
            mat = fitz.Matrix(2, 2)
            pm = page.get_pixmap(matrix=mat, alpha=False)

            # if width or height > 2000 pixels, don't enlarge the image
            if pm.width > 2000 or pm.height > 2000:
                pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)

            img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
            img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
            imgs.append(img)

    flag_gif=False
    flag_pdf=True
    img_name = os.path.basename(img_path).split(".")[0]
    img_paths = []
    for index, pdf_img in enumerate(imgs):
        os.makedirs(os.path.join(output, img_name), exist_ok=True)
        pdf_img_path = os.path.join(
            output, img_name, img_name + "_" + str(index) + ".jpg"
        )
        cv2.imwrite(pdf_img_path, pdf_img)
        img_paths.append([pdf_img_path, pdf_img])

    all_res = []
    engine=PPStructure(recovery=True)
    for index, (new_img_path, imgs) in enumerate(img_paths):
        print("processing {}/{} page:".format(index + 1, len(img_paths)))
        result = engine(imgs, img_idx=index)
        save_structure_res(result, output, img_name, index)

        from copy import deepcopy
        from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes

        h, w, _ = imgs.shape
        result_cp = deepcopy(result)
        result_sorted = sorted_layout_boxes(result_cp, w)
        all_res += result_sorted
    from paddleocr.ppstructure.recovery.recovery_to_doc import convert_info_docx
    convert_info_docx(img, all_res, output, img_name)


recovery("work/PaddleOCR/ppstructure/docs/recovery/UnrealText.pdf","./output/")

注意

output参数：最后需要有"/"，不能为"./output"，必须为"./output/"，否则会报错
图片的面板恢复python脚本、图片和pdf面板恢复的命令行方式在ppstructure/recovery/README_ch.md · PaddlePaddle/PaddleOCR - Gitee.com中都有
上述代码参考pdf面板恢复的命令行方法，在git下来的仓库的PaddleOCR\paddleocr.py中
我是在ai studio的工作台运行的代码，python=3.10