Python 提取 PDF 中的表格数据

最新推荐文章于 2024-07-23 10:53:38 发布

achi010

最新推荐文章于 2024-07-23 10:53:38 发布

阅读量7k

点赞数 6

本文链接：https://blog.csdn.net/achi010/article/details/117828779

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

文章目录

1、环境说明
2、pdfplumber
- 优缺点分析
3、camelot
- 优缺点分析
4、总结

1、环境说明

操作系统 ： Winodws

语言： Python 3.8.5

IDE ： PyCharm 2019.3.5 (Professional Edition)

框架 1 ： pdfplumber 0.5.28 官网

框架 2 ： camelot 0.8.2 官网

说明：这两个框架近期还算活跃，很多框架都1年以上没有活跃，就不考虑了。
PS ： PyCharm 在 Windows 操作系统下，使用虚拟环境安装依赖包，可能会报错，可以尝试参考，通过安装后缀名为 .whl 的文件，实现离线安装。

PS : 测试用的 pdf 文档下载地址。

在这里插入图片描述

2、pdfplumber

代码

import pandas as pd
import pdfplumber

pdf = pdfplumber.open("D:\\Cache\\foo.pdf")
page = pdf.pages[0]
table = page.extract_table()
df = pd.DataFrame(table)
df.to_excel("D:\\Cache\\foo.xlsx", header=False, index=False)

测试结果

在这里插入图片描述

优缺点分析

优点：
1、因为这个框架，不是专门设计用来提取表格数据，所以可以通过 text 对数据进行搜索，根据关键字只提取目标数据；
2、准确率高；（暂时没发现有数据错误提取的情况）
缺点：
1、表格 title 复杂，导致提取结果提取不全；（title 一般没那么重要，大多数情况下可以忍受）
2、提取数据导出到 Excel 或者 CSV 需要使用 pandas 框架，学习成本更高一点,不过也还可以接受；

3、camelot

代码

import camelot

# read the pdf file
tables = camelot.read_pdf('D:\\Cache\\foo.pdf', pages='1-end')

# json, excel, html, sqlite
tables.export('D:\\Cache\\foo.xlsx', f='excel')