Python如何用pdfplumber读取解析pdf文件

ToMiky明明

已于 2023-07-14 11:24:49 修改

阅读量1k

点赞数 2

文章标签： python pdf 开发语言

于 2023-07-13 17:14:32 首次发布

本文链接：https://blog.csdn.net/weixin_43404930/article/details/131581987

版权

1.首先安装pdfplumber库：

pip install pdfplumber

2.如果安装失败，首先应该升级pip，用低版本的pip可能导致pdfplumber安装不成功：

python -m pip install --upgrade pip

# coding:utf-8

import pdfplumber

with pdfplumber.open('./test.pdf') as pdf:
    # 遍历每个页面
    for page in pdf.pages:
        # 获取当前页面的全部文本信息，包括表格中的文字,没有内容则打印None
        print(page.extract_text())
        # 提取当前页面中的所有表格
        print(page.extract_tables())   
        #没有表格，则返回[]，有表格则返回[[[row1],[row2]...],[[row1],[row2]...]...]
        # 遍历提取到的每个表
        for table in page.extract_tables():
            print(table) # [[row1],[row2]...]
            # 遍历每一行数据
            for row in table:
                print(row) # ['xxx','xxx'...]

3. 如果不用with方法，则首先要打开一个pdf：

pdf = pdfplumber.open(pdfPath)

4. 获取pdf首页的最大的表格内容，并且返回一个嵌套的list，这时候直接用list的元素索引直接找到所需要的元素即可（eg: tablevalue[1][2]...）：

tableValue = pdf.pages[0].extract_table()

5. 其中pdf.pages表示pdf的每一页，首页则为pdf.pages[0]，第二页为pdf.pages[1]，末页为pdf.pages[-1]等。

6. 获取pdf首页的所有表格内容，并且返回一个嵌套的list：

tableValue = pdf.pages[0].extract_tables()

7. 获取pdf首页的所有表格内容，并且返回一个字符串（这时候就用正则表达式去提取具体内容）：

tableValue = pdf.pages[0].extract_text()

8. 以上为基础用法，更加详细可以看看大佬的如下链接：

pdfplumber/README-CN.md at stable · hbh112233abc/pdfplumber · GitHub

ToMiky明明

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Python如何用pdfplumber读取解析pdf文件

其中pdf.pages表示pdf的每一页，首页则为pdf.pages[0]，第二页为pdf.pages[1]，末页为pdf.pages[-1]等。
复制链接

扫一扫

Python如何用pdfplumber读取解析pdf文件

“相关推荐”对你有帮助么？