Python实现PDF文件转表格

最新推荐文章于 2024-04-07 23:12:55 发布

繁梦溪

最新推荐文章于 2024-04-07 23:12:55 发布

阅读量1.1k

点赞数

分类专栏： Python 文章标签： python PDF

本文链接：https://blog.csdn.net/fg24151110876/article/details/116987999

版权

Python 专栏收录该内容

26 篇文章 5 订阅

订阅专栏

方式一：tabula-py

需要安装java

#pip install tabula-py
import tabula
import pandas as pd
df = tabula.read_pdf("D:\\我的文档\\Python\\2019221145237597.pdf", 
                     encoding='gbk', pages='all')
print(df)
for indexs in df.index:
    # 遍历打印
    print(df.loc[indexs].values[0:-1])

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)

# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")

# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv')

方式二：pdfplumber

Pdfplumber是一个可以处理pdf格式信息的库。可以查找关于每个文本字符、矩阵、和行的详细信息，也可以对表格进行提取并进行可视化调试。

https://github.com/jsvine/pdfplumber

简单使用

import pdfplumber
with pdfplumber.open("path/file.pdf") as pdf:
    first_page = pdf.pages[0]  #获取第一页
    print(first_page.chars[0])

dfplumber.pdf中包含了.metadata和.pages两个属性。
metadata是一个包含pdf信息的字典。
pages是一个包含页面信息的列表。

每个pdfplumber.page的类中包含了几个主要的属性。
page_number 页码
width 页面宽度
height 页面高度
objects/.chars/.lines/.rects 这些属性中每一个都是一个列表，每个列表都包含一个字典，每个字典用于说明页面中的对象信息，包括直线，字符，方格等位置信息。

常用方法

extract_text() 用来提页面中的文本，将页面的所有字符对象整理为的那个字符串
extract_words() 返回的是所有的单词及其相关信息
extract_tables() 提取页面的表格
to_image() 用于可视化调试时，返回PageImage类的一个实例

表提取设置

默认情况下，extract_tables使用页面的垂直和水平线（或矩形边）作为单元格分隔符。但是方法该可以通过table_settings参数高度定制。可能的设置及其默认值：

{
    "vertical_strategy": "lines", 
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "keep_blank_chars": False,
    "text_tolerance": 3,
    "text_x_tolerance": None,
    "text_y_tolerance": None,
    "intersection_tolerance": 3,
    "intersection_x_tolerance": None,
    "intersection_y_tolerance": None,
}

举例使用

读取文字

import pdfplumber
import pandas as pd

with pdfplumber.open("E:\\600aaa_2.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到页数
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)
        # 获取当前页面的全部文本信息，包括表格中的文字
        print(page.extract_text())

读取表格

import pdfplumber
import pandas as pd
import re

with pdfplumber.open("E:\\600aaa_1.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到页数
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)

        for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): # 边缘相交合并单元格大小

            # print(pdf_table)
            for row in pdf_table:
                # 去掉回车换行
                print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])

案例

import pandas as pd
def to_table(pdf_table):
#将DataFrame第一行作为表头
    df=pd.DataFrame(pdf_table)

    df.columns = df.iloc[0]
    df=df.drop(df.index[0])
    return df

import pdfplumber
import pandas as pd
import re

with pdfplumber.open("/mnt/c/Users/admin/Downloads/202104291855528(file)附件：天津市医保药品支付范围信息维护明细表（2021年第四期）.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到页数
    i=0
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)

        for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): # 边缘相交合并单元格大小

            
#             table=pd.DataFrame(pdf_table)
            table=to_table(pdf_table)
            table['PDF页码']='第[{}]页'.format(page.page_number)
            print(table)
            i=i+1
            if i==1:
                ddf=table
            else:
                ddf=pd.concat([ddf, table])
            
ddf.to_excel('/mnt/c/Users/admin/Downloads/202104291855528(file)附件：天津市医保药品支付范围信息维护明细表（2021年第四期）.xlsx',index=False)