python之——获取网页源数据并格式化成excle

最新推荐文章于 2024-07-07 06:42:20 发布

Snail_cz

最新推荐文章于 2024-07-07 06:42:20 发布

阅读量1.8k

点赞数 1

分类专栏： python 文章标签： python requests_html BeautifulSoup 获取网站数据 python操作excle

本文链接：https://blog.csdn.net/Cow_cz/article/details/81481769

版权

python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前言：
本文介绍的工具为requests-html库和BeautifulSoup库，可以很好的抓取网页数据，涉及该方面不久，本帖为个人学习笔记

requests-html库更适用于抓取互联网上网页的数据，并进行按需解析等
如果仅仅是抓取本地的html数据，并格式化，那可以用BeautifulSoup这个库

1. requests-html库

特别注意一点，**查阅多方资料发现requests-html库只支持python3.6及以上版本（个人在python2.X版本安装不成功）**所以使用低版本python的需要升级下python

1.1 安装方法

linux环境：

pip install requests-html

windows环境：(进入到python安装目录…/Scripts下，打开终端执行以下命令)

pip3 install requests-html

等待安装完成即可

1.2 验证安装结果

在终端执行：python，然后执行：

>>> from requests_html import  HTMLSession
>>>

如果没有报错信息，证明安装成功

1.3 基本使用

获取网页数据

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.baidu.com')
// 查看页面内容
print(r.html.text)

获取网页所有链接

from requests_html import HTMLSession

session = HTMLSession()
r = session.get('https://www.baidu.com')
print(r.html.html)

网上关于使用的方法很多，这里不再记录，主要说下下面一个案例

1.4 问题记录

【问题一】
再执行pip install requests-html命令时，报错：Cannot uninstall ‘chardet’

...
Installing collected packages: chardet, urllib3, certifi, idna, requests, cssselect, pyquery, requests-html
  Found existing installation: chardet 2.0.1
Cannot uninstall 'chardet'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

解决方法： 找到chardet*.egg-info格式的文件，将其删除即可，可以参考以下命令：

cd /
find * -name 'chardet*'
rm -rf usr/lib/python2.7/site-packages/chardet-2.0.1-py2.7.egg-info
/usr/bin/pip install requests

【问题二】
在编写脚本后，运行，会提示requests-html库文件init方法报错，类似如下：

 File "/usr/lib/python2.7/site-packages/requests_html.py", line 21
    def __init__(self, element, html=None, url):
    ^
IndentationError: unexpected indent

解决方法：修改报错所在行，将init函数入参中的*删除，即将命名关键字参数改为普通参数，编译通过

2. BeautifulSoup获取本地网页数据并格式化的案例

【说明】案例为将以下网页数据，格式化到excle中：
这里写图片描述
excle中的行和列对应如下：

下面说下步骤

2.1 安装bs4

因为BeautifulSoup是bs4库的一个模块，所以安装下bs4

 pip install bs4

等待安装完成即可

2.2 xlwt库安装

xlwt用于操作excle，本案例中用于将格式化后的数据到瑞到excle，如果没安装过，执行以下命令安装：

 pip  install xlwt

以上两步安装如有问题可以网上查找下解决方法，相信这个很简单~~~~

2.3 案例步骤解析

2.3.1 打开本地html文件，并实例化成BeautifulSoup的对象

url = r'E:\pySpace\pythonpro\venv\181\index.html'

#打开html文件
with open(url, 'r' ,encoding='utf-8') as f:
    Soup = bs4.BeautifulSoup(f.read(), 'lxml')

2.3.2 使用方法select来获取节点数据

使用浏览器打开本地网页，我这里使用的是谷歌浏览器，然后鼠标放至网页中名称内容“服务正在运行中”，右键检查，出现网页源码。在右侧高亮行右键copy=》copy selector，打开文本粘贴：

#vuln_distribution > tbody > tr.odd.vuln_high > td:nth-child(2) > span

这里写图片描述
打印节点数据

sel = "#vuln_distribution > tbody > tr.odd.vuln_high > td:nth-child(2) > span" 
title = list(Soup.select(sel))
for ititle in range(0,len(title)):
    strval1= title[ititle].text
    print("val = ",strval1.strip())

特别注意： 这样拷贝出来的selector不一定是全部兼容！！！如碰到需要进一步调整
比如：要获取“主机”一栏数据，这样拷贝出来是：

#table_4_1_1 > td > table > tbody > tr:nth-child(1) > th

这里写图片描述
发现这样并不能得到想要的数据信息。仔细观察发现id = “#table_4_1_1”，还有上层的id，于是想着在往上标记一层，是不是会得到想要的数据，尝试这样：

#vuln_distribution > tbody > #table_4_1_1 > td > table > tr > th

再次打印数据，发现是想要的数据

2.3.3 获取不同栏的数据

例如案例中“ 服务正在运行中”和“Apache连接器”是处于不同的

中，进一步获取所有table的数据

for ititle in range(0,len(title)):
	#获取每个小标题的属性内容
    strsel2 = "#vuln_distribution > tbody > #table_4_1_{} > td > table > tr > th".format(ititle+1)
    strsel3 = "#vuln_distribution > tbody > #table_4_1_{} > td > table > tr > td".format(ititle+1)
    listsel2 = list(Soup.select(strsel2))
    listsel3 = list(Soup.select(strsel3))
    print("listsel2 = ",listsel2)
    print("listsel3 = ",listsel3)

2.3.4 将数据格式化到excle

将数据封装到dic

#大标题
sel4 = "#vuln_distribution > tbody > tr > td:nth-of-type(3)"
sel5 = "#vuln_distribution > tbody > tr > td:nth-of-type(4)"
sel6 = "#vuln_distribution > tbody > tr > td:nth-of-type(5)"

fi4 = list(Soup.select(sel4))
fi5 = list(Soup.select(sel5))
fi6 = list(Soup.select(sel6))

dic = { 0:["名称","主机","百分比","次数","属性","属性值"]}
title = list(Soup.select(sel2))
# print(len(title1))
# print("title1 = ",title1[0].text)
for ititle in range(0,len(title)):
    strval1= title[ititle].text
    key = strval1.strip()
    dic[ititle+1] = []
    dic[ititle+1].append(key)
    print("val = ",strval1.strip())
    #主机，百分比，次数
    strval4 = fi4[ititle].text
    strval5 = fi5[ititle].text
    strval6 = fi6[ititle].text
    dic[ititle+1].append(strval4)
    dic[ititle+1].append(strval5)
    dic[ititle+1].append(strval6)

这里只是将“个数” 、“百分比” 、“次数”格式化的方法，其他的类似
将dic内容放入到excle
因为dic中数据是无序的，放入excle的时候需要进行排序，以下代码为参考网上资料（具体哪个文章找不到了。。。），很实用：

#存入excle操作
file = Workbook(encoding = 'utf-8')
table = file.add_sheet('data')

ldata = []
# for循环指定取出key值存入num中
num = [a for a in dic]
# 字典数据取出后无需，需要先排序
num.sort()

# for循环将data字典中的键和值分批的保存在ldata中
for x in num:
    t = [int(x)]
    for a in dic[x]:
        t.append(a)
    ldata.append(t)

# 将数据写入文件,i是enumerate()函数返回的序号数
for i, p in enumerate(ldata):
    for j, q in enumerate(p):
        # print i,j,q
        table.write(i, j, q)
file.save('报表汇总.xlsx')

这杨就实现了我们最开始的需求。

参考资料：
精华帖：https://blog.csdn.net/nkwshuyi/article/details/79435248 （推荐）
https://www.jb51.net/article/81971.htm

Snail_cz

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
2
评论
python之——获取网页源数据并格式化成excle

前言：本文介绍的工具为requests-html库和BeautifulSoup库，可以很好的抓取网页数据，涉及该方面不久，本帖为个人学习笔记requests-html库更适用于抓取互联网上网页的数据，并进行按需解析等如果仅仅是抓取本地的html数据，并格式化，那可以用BeautifulSoup这个库1. requests-html库特别注意一点，查阅多方资料发现request...
复制链接

扫一扫

专栏目录