MinerU PDF 文档提取 Demo （PDF解析）

shizidushu

已于 2024-07-18 19:26:15 修改

阅读量1.8k

点赞数 14

文章标签： pdf LLM 人工智能

于 2024-07-17 19:06:24 首次发布

本文链接：https://blog.csdn.net/shizidushu/article/details/140503124

版权

MinerU PDF 文档提取 Demo （PDF解析）

说明：

首次发表日期：2024-07-17
MinerU 官方仓库： https://github.com/opendatalab/MinerU
MinerU 官方中文README: https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md

官方介绍

Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。

主要功能包含

支持多种前端模型输入
删除页眉、页脚、脚注、页码等元素
符合人类阅读顺序的排版格式
保留原文档的结构和格式，包括标题、段落、列表等
提取图像和表格并在markdown中展示
将公式转换成latex
乱码PDF自动识别并转换
支持cpu和gpu环境
支持windows/linux/mac平台

创建Conda环境并安装包

conda create -n py310torch python=3.10
conda activate py310torch
pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu121

# MinerU
## https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md
pip install magic-pdf[full-cpu]
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/

下载权重并配置

# 安装依赖
pip install -U "huggingface_hub[cli]"
# 设置环境变量
## Linux
## export HF_ENDPOINT=https://hf-mirror.com
## Windows
$env:HF_ENDPOINT = "https://hf-mirror.com"
huggingface-cli download wanderkid/PDF-Extract-Kit

cd ~ 并创建文件magic-pdf.json，将 magic-pdf.template.json 文件中的内容拷贝进去，并修改模型权重所在位置，修改之后的内容如下：

{
    "bucket_info":{
        "bucket-name-1":["ak", "sk", "endpoint"],
        "bucket-name-2":["ak", "sk", "endpoint"]
    },
    "temp-output-dir":"tmp",
    "models-dir":"D:/16-LLM-Cache/huggingface/hub/models--wanderkid--PDF-Extract-Kit/snapshots/bbfd601d3dab736bf366e2119ec0bbe0f4e6f012/models",
    "device-mode":"cuda"
}

其中device-mode设置为cuda，如果使用CPU的话，设置为cpu。

使用

使用命令行

首先运行magic-pdf pdf-command --help，看一下这个命令行有哪些参数：

Usage: magic-pdf pdf-command [OPTIONS]

Options:
  --pdf PATH               PDF文件的路径  [required]
  --model PATH             模型的路径
  --method [ocr|txt|auto]  指定解析方法。txt: 文本型 pdf 解析方法， ocr: 光学识别解析 pdf, auto:
                           程序智能选择解析方法
  --inside_model BOOLEAN   使用内置模型测试
  --model_mode TEXT        内置模型选择。lite: 快速解析，精度较低，full: 高精度解析，速度较慢

然后调用：

magic-pdf pdf-command --pdf "assets/***手册.pdf" --inside_model true --model_mode full

使用一个大小为2.45M的某硬件手册PDF文档（6页，中文为主，内容是分列的，带图片的）进行测试，发现：

性能速度：运行以上命令花了47秒
优点：相比marker来说，从文字上来说，效果要更好
- marker转换的结果出现了文字顺序出错的问题，特别是涉及PDF中的图片时
  - 使用的PDF中，大多数图片部分本身是包含文字的，可以复制
  - marker将这些图片中的文字识别出来了，但是和正常的文字混合在一起了，而且会把顺序弄错
缺点：
- 没有对文档中的表格做识别
  - 补充：已经有人提了issue，有人回复表格识别应该在1个月左右放出
- 板式识别时将第二栏的二级标题当做大标题放在第一行了（大概是因为其超出的第一栏大标题的高度）

通过接口（Python代码）调用

参考代码： https://github.com/opendatalab/MinerU/blob/master/demo/demo.py

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config

model_config.__use_inside_model__ = True

with open('assets\***手册.pdf', "rb") as pdf_file:
    pdf_bytes = pdf_file.read()

local_image_dir = "mineru_images"

image_writer = DiskReaderWriter(local_image_dir)

model_json = []  # model_json传空list使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown(local_image_dir, drop_mode="none")