pyhthon提取PDF文档中的数据存入Excel表中

拓云者也

已于 2022-10-16 21:33:37 修改

阅读量1.2k

点赞数 1

文章标签： python

于 2022-10-16 21:32:22 首次发布

本文链接：https://blog.csdn.net/tuoyunzhe/article/details/127353279

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、pdfplumber是什么？
二、使用步骤
- 1.引入库
- 2.读入数据
总结

前言

小编也是现学现卖，留下这个代码主要是帮助自己学习记忆，也供大家参考，水平较低，勿喷！

提示：在提取PDF文档中的数据时所用到的主要是pdfplumber、然后用xlwt来创建Excel表进行数据的写入

一、pdfplumber是什么？xlwt是什么？

Pdfplumber是一个可以处理pdf格式信息的库。可以查找关于每个文本字符、矩阵、和行的详细信息，也可以对表格进行提取并进行可视化调试。

xlwt库是一个python用于操作excel的第三方库。它的主要功能是用来写入excel。通常会与xlrd 、 xlutils组合进行使用。

二、使用步骤

1.引入库

代码如下（示例）：

导入所要使用的库

import pdfplumber
import xlwt

2.读入数据

代码如下（示例）：

用with open打开你所要提取的PDF文件

# 读取pdf文件
with pdfplumber.open('C:\\Users\\huain\\Desktop\\pnas.pdf') as pdf:
    # 读取文档总页码
    pages = pdf.pages

3.创建Excel，进行数据提取，并保存

 # 定义一个数组
    item = []
    for page in pages[21:31]:
        # 读取表格数据
        table = page.extract_table()
    for i in table:
            item.append(i)
    # 创建Excel文本
    work_book = xlwt.Workbook(encoding='utf-8')
    # 新建sheet表格
    work_sheet = work_book.add_sheet('sheet1')
    # 定义列名
    col = item[0]
    # 将col写入到表单第一行，['省市', '地市', '用电类别', '当期值']
    for i in range(0, len(col)):
        work_sheet.write(0, i, col[i])
    # 将数据写入到sheet表中
    for i in range(0, len(item[1:])):
        data = item[1:][i]
        for j in range(0, len(col)):
            work_sheet.write(i + 1, j, data[j])
    # 保存
    work_book.save('test31.xls')

完整代码如下：

import pdfplumber
import xlwt

# 读取pdf文件
with pdfplumber.open('C:\\Users\\huain\\Desktop\\pnas.pdf') as pdf:
    # 读取文档总页码
    pages = pdf.pages
    # 定义一个数组
    item = []
    for page in pages[21:31]:
        # 读取表格数据
        table = page.extract_table()
    for i in table:
            item.append(i)
    # 创建Excel文本
    work_book = xlwt.Workbook(encoding='utf-8')
    # 新建sheet表格
    work_sheet = work_book.add_sheet('sheet1')
    # 定义列名
    col = item[0]
    # 将col写入到表单第一行，['省市', '地市', '用电类别', '当期值']
    for i in range(0, len(col)):
        work_sheet.write(0, i, col[i])
    # 将数据写入到sheet表中
    for i in range(0, len(item[1:])):
        data = item[1:][i]
        for j in range(0, len(col)):
            work_sheet.write(i + 1, j, data[j])
    # 保存
    work_book.save('test31.xls')

以我处理的pdf为例：