fitz 提取pdf表格

飞锡2024

已于 2024-03-19 10:23:49 修改

阅读量863

点赞数 2

分类专栏： pdf 文章标签： pdf

于 2024-03-07 17:13:41 首次发布

本文链接：https://blog.csdn.net/weixin_38235865/article/details/135369618

版权

pdf 专栏收录该内容

5 篇文章

订阅专栏

参考：
https://github.com/pymupdf/PyMuPDF

Fitz 是一个 Python 库，用于提取 PDF 文件中的文本和表格数据。可以轻松地从 PDF 文件中提取表格数据，并将其转换为可用的数据结构，如 DataFrame。这个库基于 Poppler 工具库，可以解析 PDF 文件并提取其中的文本和表格信息。

fitz 包与pdfplumber,camlot包对比

Fitz (PyMuPDF): Fitz 是一个功能强大的库，提供了广泛的功能来处理 PDF 文件。它不仅可以提取文本和表格，还可以用于渲染页面、修改文档和提取元数据。Fitz 的一个主要优点是它的性能，它提供了快速的文本提取速度。此外，Fitz 支持多种格式的 PDF 文件，包括加密的 PDF。然而，Fitz 主要关注于文本提取，对于表格数据的提取可能不如专门的表格提取工具灵活或准确[1]。

pdfplumber: pdfplumber 是专门设计来提取 PDF 文档中的文本和表格数据的 Python 库。它提供了一种相对简单的方式来访问和提取 PDF 文件中的信息。pdfplumber 的一个突出特点是它对表格提取的支持，它可以识别表格结构并以结构化的形式提取数据。此外，pdfplumber 允许用户通过可视化页面布局来调整提取策略，这增加了对复杂文档的处理能力。然而，pdfplumber 可能在处理非常复杂或不规则格式的 PDF 文件时遇到挑战。

Camelot: Camelot 是另一个专注于从 PDF 文件中提取表格数据的 Python 库。与 pdfplumber 类似，Camelot 提供了强大的表格识别和提取功能。Camelot 的一个主要优势是它允许用户选择不同的表格识别引擎（如 Stream 或 Lattice），以便更好地处理不同类型的表格布局。Camelot 也支持导出到多种格式，包括 CSV、Excel 和 HTML。但是，Camelot 主要关注表格提取，并不提供广泛的文本提取功能。

提取表格代码

python
 
import time
import fitz  # PyMuPDF

# 打开PDF文件
import pandas as pd
from tqdm import  tqdm
pdf_file = r"xxxpdf"
pdf_document = fitz.open(pdf_file)
total_df = pd.DataFrame([],columns= ['','','']])#表头设置
start_time = time.time()
# 遍历每一页并提取表格数据  
for page in tqdm(pdf_document):
    tabs = page.find_tables() #  text = page.get_text() # get plain text encoded as UTF-8
    # tab = tabs[0]
    # tab.bbox (x0, y0, x1, y1) of rectangle coordinates. 
    # tab.col_count
    # tab.row_count
    page = str(page).split(' ')[1]
    for tab in tabs:
        df  = tab.to_pandas()
        df['页数'] = page
        total_df = pd.concat([total_df,df.iloc[1:,:]],axis = 0)
    # extract  extracts all text of the table as a list of lists, which each contain the string of the respective cell. We will see an example further down.
    # tab.to_pandas()
    # tab.rows  a list of lists of cells in the respective row.
    # col_name = tab.header.names
# 关闭PDF文件
pdf_document.close()
total_df.to_excel('xxx.xlsx')