【python学习】董付国微课--Python爬取网页中的表格保存为Excel文件

最新推荐文章于 2024-01-06 13:32:08 发布

anITfish

最新推荐文章于 2024-01-06 13:32:08 发布

阅读量415

点赞数

分类专栏： Python 文章标签： python 学习

本文链接：https://blog.csdn.net/qq_31949641/article/details/128546762

版权

Python 专栏收录该内容

104 篇文章 5 订阅

订阅专栏

学习自Python小屋公众号，微课–Python爬取网页中的表格保存为Excel文件https://mp.weixin.qq.com/s/yx6ryvAaMAdF9Sa04NIuIw

程序运行效果：
在这里插入图片描述
代码如下：

from re import findall,sub
from urllib.request import urlopen
from openpyxl import Workbook

#手机上打开该公众号文章，复制链接
url='https://mp.weixin.qq.com/s/RtFzEm2TnGHnLTHMz9T4Aw'
with urlopen(url) as fp:
    ##一定要使用浏览器打开目标网页，确定是否使用UTF-8编码格式
    #空白处右键查看网页源代码里面charset=utf-8"
    content=fp.read().decode()
#创建空白excel文件，删除默认生成的空白工作表，后面根据需要生成表
wb=Workbook()
wb.remove(wb.worksheets[0])
#一定要在浏览器中查看网页源代码 对照着理解和编写正则表达式
##在页面上找一个完整的表格
#ctrl+f 搜索<table 对应网页看看是否为需要的表格
pattern='<table.*?><tbody>(.+?)</tbody></table>'
#print(findall(pattern,content))
#找到所有匹配正则表达式的表格 从1开始计数
for index,table in enumerate(findall(pattern,content),start=1):
    #为网页中每个表格创建一个工作表
    #ws=wb.create_sheet(f'Sheet{index}')
    ws=wb.create_sheet('Sheet{}'.format(index))
    #提取每一行，结合网页源代码编写和理解正则表达式
    pattern='<tr.*?>(.+?)</tr>'
    for row in findall(pattern,table):
        #提取一行中的单元格文本，删除其中的HTML标签
        pattern='<td.*?>(.+?)</td>'
        cells=findall(pattern,row)
        #此时获取到的<td>标签里面有<p><span>文字</span></p>
        #把<.+?>或&nbsp;空格替换为空字符串，即删掉，只保留文字
        cells=[sub('<.+?>|&nbsp;','',cell) for cell in cells]
        #写入excel文件
        #追加到行，cells中有几个元素，一行就有几个单元格
        ws.append(cells)
wb.save('网页中的表格信息.xlsx')