用Python实现pdf读取数据填入Excel

yangwj211

已于 2024-03-04 12:51:49 修改

阅读量743

点赞数 5

分类专栏： intern 文章标签： python pdf excel

于 2024-02-27 13:18:03 首次发布

本文链接：https://blog.csdn.net/yangwj211/article/details/136319509

版权

intern 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

一、使用pdfplumber截取pdf文档中的信息

import pdfplumber
import pandas as pd

# 读取PDF文件中的信息
with pdfplumber.open("FS.pdf") as pdf:
    page01 = pdf.pages[1]  # 指定页码
    text = page01.extract_text()
    holdings_index1 = text.index('Top')
    holdings_index2 = text.index('The information provided')
    holdings = text[holdings_index1:holdings_index2]
    holdings_no_title = holdings.replace('Top 10 Holdings', '')
    holdings_splited = holdings_no_title.split('\n')

    holding_name = []
    holding_data = []
    for i in range(1, 11):
        data = holdings_splited[i][-5:]
        name = holdings_splited[i][0:-6]
        holding_data.append(data)
        holding_name.append(name)

二、读取excel文件中的数据

xls = pd.ExcelFile('GreatLink_Top_10_holdings.xlsx')
sheets = {}

使用 pd.ExcelFile('GreatLink_Top_10_holdings.xlsx') 创建一个 ExcelFile 对象。这个对象表示待读取的 Excel 文件。

for sheet_name in xls.sheet_names:
    sheets[sheet_name] = xls.parse(sheet_name)

使用 xls.sheet_names 获取 Excel 文件中的所有工作表名称。

对于每个工作表名称 sheet_name，通过循环迭代来处理每个工作表。

使用 xls.parse(sheet_name) 解析指定名称的工作表，并将其内容读取到一个 DataFrame 中。这一步骤将工作表的内容转换为 DataFrame。

将每个工作表的 DataFrame 存储在一个字典中，字典的键是工作表的名称，值是相应的 DataFrame。

三、将新数据追加到名为 "test" 的工作表中

创建一个新的 DataFrame 对象 new_data，并将其与现有的 DataFrame 对象 df 进行连接，生成一个合并后的新 DataFrame。

# 将新数据追加到名为 "test" 的工作表中
df = sheets['test']
new_data = pd.DataFrame({'Top 10 holdings': holding_name, 'Rate': holding_data})
df = pd.concat([df, new_data], ignore_index=True)

pd.DataFrame({'Top 10 holdings': holding_name, 'Rate': holding_data})：
这一行代码创建了一个新的 DataFrame 对象 new_data。它是通过一个字典构建的，字典中的键是列名，对应的值是列的数据。'Top 10 holdings' 列的数据来自变量 holding_name，而 'Rate' 列的数据来自变量 holding_data。
pd.concat([df, new_data], ignore_index=True)：
这一行代码使用 pd.concat() 函数将现有的 DataFrame df 与新创建的 DataFrame new_data 进行连接。参数 ignore_index=True 表示忽略连接后的索引，生成一个新的索引序列。结果是生成一个合并后的新 DataFrame，其中包含了现有 DataFrame df 的数据以及新创建的 DataFrame new_data 的数据。

四、写入更新后的数据到Excel文件中，仅对 "test" 工作表进行更新，不影响其他工作表

with pd.ExcelWriter('GreatLink_Top_10_holdings.xlsx') as writer:
    for sheet_name, sheet_df in sheets.items():
        if sheet_name == 'test':
            df.to_excel(writer, sheet_name=sheet_name, index=False, header=True)
        else:
            sheet_df.to_excel(writer, sheet_name=sheet_name, index=False, header=True)

使用 pd.ExcelWriter 打开名为 "GreatLink_Top_10_holdings.xlsx" 的 Excel 文件，并逐个将字典 sheets 中的 DataFrame 对象写入到相应的工作表中。

存在的问题：excel的格式丢失了

import pdfplumber
import pandas as pd

# 读取PDF文件中的持仓信息
with pdfplumber.open("FS.pdf") as pdf:
    page01 = pdf.pages[1]  # 指定页码
    text = page01.extract_text()
    holdings_index1 = text.index('Top')
    holdings_index2 = text.index('The information provided')
    holdings = text[holdings_index1:holdings_index2]
    holdings_no_title = holdings.replace('Top 10 Holdings', '')
    holdings_splited = holdings_no_title.split('\n')

    holding_name = []
    holding_data = []
    for i in range(1, 11):
        data = holdings_splited[i][-5:]
        name = holdings_splited[i][0:-6]
        holding_data.append(data)
        holding_name.append(name)

# 读取Excel文件中的数据
xls = pd.ExcelFile('GreatLink_Top_10_holdings.xlsx')
sheets = {}
for sheet_name in xls.sheet_names:
    sheets[sheet_name] = xls.parse(sheet_name)

# 将新数据追加到名为 "test" 的工作表中
df = sheets['test']
new_data = pd.DataFrame({'Top 10 holdings': holding_name, 'Rate': holding_data})
df = pd.concat([df, new_data], ignore_index=True)

# 写入更新后的数据到Excel文件中，仅对 "test" 工作表进行更新，不影响其他工作表
with pd.ExcelWriter('GreatLink_Top_10_holdings.xlsx') as writer:
    for sheet_name, sheet_df in sheets.items():
        if sheet_name == 'test':
            df.to_excel(writer, sheet_name=sheet_name, index=False, header=True)
        else:
            sheet_df.to_excel(writer, sheet_name=sheet_name, index=False, header=True)

补充：

python 用pandas合并多个excel

1.利用循环，分别读取excel，存入链表中

filePath = './Top_10_Holdings/'
    #获取文件夹的文件目录
    list_filename = os.listdir(filePath)
    file_num = len(list_filename)
    list_data = []
    
    for i in range(0,file_num):
        fileName = list_filename[i]
        file_Name_With_Path = filePath+fileName
        df_data = pd.read_excel(file_Name_With_Path)
        list_data.append(df_data)
    list_data.reverse()

2.调用concat将链表拼接，并保存写入新的excel。

 df_data_merge = pd.concat(list_data, ignore_index=True)

    fileName_merge = filePath + 'merge.xlsx'
    df_data_merge.to_excel(fileName_merge)

concat参考如下：

Merge, join, concatenate and compare — pandas 2.2.1 documentation (pydata.org)

yangwj211

关注

5
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
用Python实现pdf读取数据填入Excel

解析指定名称的工作表，并将其内容读取到一个 DataFrame 中。这一步骤将工作表的内容转换为 DataFrame。将每个工作表的 DataFrame 存储在一个字典中，字典的键是工作表的名称，值是相应的 DataFrame。这个对象表示待读取的 Excel 文件。中的 DataFrame 对象写入到相应的工作表中。进行连接，生成一个合并后的新 DataFrame。获取 Excel 文件中的所有工作表名称。，并将其与现有的 DataFrame 对象。创建一个新的 DataFrame 对象。
复制链接

扫一扫