极速系列04—python批量获取word/PDF/网页中的表格

云晓-

已于 2022-10-05 20:42:17 修改

阅读量4.4k

点赞数 5

分类专栏： python助力自动化办公文章标签： python pandas 数据分析

于 2022-09-18 20:54:06 首次发布

本文链接：https://blog.csdn.net/one_bird_/article/details/126922208

版权

python助力自动化办公专栏收录该内容

7 篇文章 5 订阅

订阅专栏

这里写目录标题

1 python批量获取word中的表格
2 python批量获取PDF中的表格
3 python批量获取网页中的表格

1 python批量获取word中的表格

目的：

将word中格式相同的表格中的内容，保存到Excel中

word中的表格样式如下图所示：
在这里插入图片描述 保存到excel中

在这里插入图片描述

1.1 简介

python-docx是利用python来读写word文件的第三方库。

开源地址是：https://github.com/python-openxml/python-docx
官方教程：https://python-docx.readthedocs.io/en/latest/
中文文档：https://www.osgeo.cn/python-docx/
安装：pip install python-docx -i https://pypi.tuna.tsinghua.edu.cn/simple/

1.2 读取word文档内容

利用python-docx库来读取现有的word文档数据，思路是先逐层获取对象，再提取相应对象的text属性。
特别注意：word文件需要后缀名为“.docx”的格式

方法一

提供准确的位置序号：cell(0,0)，cell(0,1)等

## 导入工具包
from docx import Document
import pandas as pd

## 读取 Word 文件
document = Document('./例子.docx')

## 读取Word中的表格
tables = document.tables

## 创建空的列表，获取同类数据
xuhao_list = []
zuoye_list = []
riqi_list = []
julebu_list = []
zihao_list = []
jindu_list = []
tongzhi_list = []
qianshou_list = []

## 循环读取数据
for i in range(len(tables)):
    xuhao_list.append(tables[i].cell(0,0).text)
    zuoye_list.append(tables[i].cell(0,1).text)
    riqi_list.append(tables[i].cell(0,2).text)
    julebu_list.append(tables[i].cell(0,3).text)
    zihao_list.append(tables[i].cell(0,4).text)
    jindu_list.append(tables[i].cell(1,1).text)
    tongzhi_list.append(tables[i].cell(1,2).text)
    qianshou_list.append(tables[i].cell(2,1).text)
## 拼接字典
info_dict = {
    '序号':xuhao_list,
    '作业':zuoye_list,
    '日期':riqi_list,
    '俱乐部':julebu_list,
    '字号':zihao_list,
    '进度':jindu_list,
    '通知':tongzhi_list,
    '签收':qianshou_list
}

## 创建DataFrame表格
pd.DataFrame(info_dict)

在这里插入图片描述

# 写入 Excel 文件
pd.DataFrame(info_dict).to_excel('./例子的内容.xlsx',index=False)

方法二

循环获取每个单元格中的内容，并使用去重逻辑，保证合并内容不被重复输出

## 导入工具包
from docx import Document
import pandas as pd

## 读取 Word 文件
document = Document('./例子.docx')
## 读取Word中的表格
tables = document.tables

#获取某个表格的行数 
#这里取得是第二个，因为所有表格格式相同，所以取哪个都可以
row = len(document.tables[1]. column_cells(0))
#获取某个表格的列数
col = len(document.tables[1]. row_cells(0))

table_all = []
for table_num in range(len(tables)):
    table_one = []
    #循环读取一个表格中的内容
    for i in range(row):
        for j in range(col):
            cell_text = document.tables[table_num].cell(i,j).text
            #因为获取的是表格中所有位置的内容，合并单元个中的内容会被重复输出，所以添加去重逻辑
            #去重逻辑添加要保证表格中每个cell中的内容不一致
            if cell_text not in table_one:
                table_one.append(cell_text)
    table_all.append(table_one)     
df = pd.DataFrame(table_all)
#为dataframe重置列名
df.columns = [ '序号','作业','日期','俱乐部','字号','进度','通知','签收']
df

在这里插入图片描述

# 写入 Excel 文件
df.to_excel('./例子的内容.xlsx',index=False)

资源下载：https://download.csdn.net/download/one_bird_/86543886

2 python批量获取PDF中的表格

目的：

使用pdfplumber读取PDF文件，自动提取每一页的表格，循环写入新的Excel文件中

PDF源文件：
保利发展控股集团股份有限公司2022 年半年度报告

2.1 简介

2.2 读取PDF中表格内容

#环境配置：
pip install pdfplumber -i https://pypi.tuna.tsinghua.edu.cn/simple

# 导入包
import pdfplumber
import pandas as pd 
import os

def get_pdf_tables(file_path, start_num, end_num):
    # 打开pdf文件
    p = pdfplumber.open(file_path)
    file_name = file_path.split('/')[1].split('.')[0]
    # 循环读取每一页
    for num in range(start_num-1, end_num):
        # 选取页数
        page = p.pages[num]  
        # 提取表格
        tables = page.extract_tables()
        if len(tables) == 0:
            print('这一页没有表格')
            continue
        else:
            print(f'第{num+1}页一共有{len(tables)}个表格') 
            # 创建文件夹
            if os.path.exists('./data') == False:
                os.mkdir('./data')
            # 循环读取
            for index_num, table in enumerate(tables):
                # 保存为df，生成表格
                df = pd.DataFrame(table)
                 # 获取表头
                df_header = df.loc[0,:]
                # 生成新表
                df = df.loc[1:,]
                df.columns = df_header
                df.to_excel(f'./data/{file_name}_第{num+1}页_第{index_num+1}张表.xlsx',index=False)

# 运行函数
get_pdf_tables("./保利地产.pdf",13,16)

在这里插入图片描述

提示

因为保存下来的表个有默认的行和列，所以删除表格的最上面一行（将其变为表头）和最左边一列
在这里插入图片描述

解决办法

# 获取表头
df_header = df.loc[0,:]
# 生成新表
df = df.loc[1:,]
df.columns = df_header

在这里插入图片描述

3 python批量获取网页中的表格

云晓-

关注

5
点赞
踩
66

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录