【python】 pdf文本与表格提取

最新推荐文章于 2025-03-18 14:26:36 发布

nencbskk

最新推荐文章于 2025-03-18 14:26:36 发布

阅读量606

点赞数 7

分类专栏： python 文章标签： python pdf java

本文链接：https://blog.csdn.net/ebatudou/article/details/135669917

版权

python 专栏收录该内容

16 篇文章

订阅专栏

本文介绍了如何使用Python库pdfplumber读取PDF文件，包括获取单页文本、全部文本以及提取单页内的表格功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

import re

提取pdf文字

import pdfplumber

def get_one_page_text(path, page=0):
“”"
读取pdf单页的文本
:param path: pdf路径
:param page: 页码，默认为第1页
:return: 文本
“”"
with pdfplumber.open(path) as pdf:
page01 = pdf.pages[page] # 指定页码
text = page01.extract_text() # 提取文本
return text

def get_all_text(path):
“”"
读取pdf的所有文本
:param path: pdf路径
:return: 文本
“”"
with pdfplumber.open(path) as pdf:
for page in pdf.pages:
text = page.extract_text() # 提取文本
return text

def get_table(path, page=0):
“”"
提取pdf单页内的表格
:param path: pdf路径
:param page: 页码，默认为0
:return: 表格
“”"
with pdfplumber.open(path) as pdf:
page01 = pdf.pages[page] # 指定页码
table1 = page01.extract_table() # 提取单个表格
# table2 = page01.extract_tables()#提取多个表格
return table1