Python爬虫学习（打卡day4）

科男小林

已于 2024-04-25 00:41:23 修改

阅读量740

点赞数 31

分类专栏： Python 文章标签： python 爬虫学习

于 2024-04-23 00:52:24 首次发布

本文链接：https://blog.csdn.net/TPLNKOYXL/article/details/138098740

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.bs4模块

前言：在解析数据的时候，我们需要的数据往往都是标签里面的文本，这个时候使用正则进行提取就相当痛苦了（因为一个网页里面相同的标签实在是太多了），我们的正则往往能拿出一系列我们不期望的结果，所以我们还需要学习其他数据解析模块。下面介绍Python中的bs4模块。

一、基本介绍

bs4 全名 BeautifulSoup，是编写 python 爬虫常用库之一，BeautifulSoup4也是一个html/xml的解析器，主要用来解析 html 标签。

二、解析器的选择

BeautifulSoup默认支持Python的标准HTML解析库，但是它也支持一些第三方的解析库：

常用解析器：html.parser,lxml,xml,html5lib，这里我选择的解析器是lxml，需要再PyCharm中进行安装。如果选择html.parser，则不用安装，因为python中自带了这个解析器。

三、在Python中的使用

下面给出一段HTML代码：

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8" />
    <title>Title</title>
</head>
<body>
    <div>
        <p>真的吗<span>一个很厉害的人</span></p>
        <ol>
            <li id="10086" class="name">周大强</li>
            <li id="10010" class="name">周芷若</li>
            <li class="name">周杰伦</li>
            <li class="name">蔡依林</li>
            <ol>
                <li>阿信</li>
                <li>信</li>
                <li>信不信</li>
            </ol>
        </ol>
    </div>
    <hr />
    <ul>
        <li><a href="http://www.baidu.com">百度</a></li>
        <li><a href="http://www.google.com">谷歌</a></li>
        <li><a href="http://www.sogou.com">搜狗</a></li>
    </ul>
    <ol>
        <li><a href="feiji">飞机</a></li>
        <li><a href="dapao">大炮</a></li>
        <li><a href="huoche">火车</a></li>
    </ol>
    <div class="job">李嘉诚</div>
    <div class="common">胡辣汤</div>
</body>
</html>

在Python中使用bs4模块对HTML进行解析。

在bs4中提供通过标签定位的方法两种：

find, 在页面中查找一个结果, 找到了就返回
find_all, 在页面中查找一堆结果. 找完了才返回

这两个功能拥有相同的参数结构：

find(标签, attrs={属性:值})

在定位成功之后，我们可能需要拿到标签里的url，img src等等，还有标签里的文本，bs4中同样存在方法：

拿到标签内的属性的值：get(属性名)
拿到标签内的文本：text

以上方实例HTML为例：

from bs4 import BeautifulSoup

f = open("测试用html.html", "r", encoding="utf-8")
html = f.read()
f.close()
soup = BeautifulSoup(html, "lxml")
section = soup.find("li")
print(section.text)
print(section.get("id"))

需要注意的是，find在找到第一个标签就进行返回，所以找到第一个周大强就返回了。

下面尝试通过find_all找到所有的li标签，拿到内容：

from bs4 import BeautifulSoup

f = open("测试用html.html", "r", encoding="utf-8")
html = f.read()
f.close()
soup = BeautifulSoup(html, "lxml")
# section = soup.find("li")
# print(section.text)
# print(section.get("id"))
sections = soup.find_all("li")
for section in sections:
    print(section.text,section.get("id"))

运行结果：

可以看到，find_all找到了所有的li标签，并存入了一个列表，我们通过循环对每个内容进行访问，可以拿到内容，如果标签不存在，则返回None。

我们也可以再find_all中添加属性的值来精确定位到我们需要的元素：

from bs4 import BeautifulSoup

f = open("测试用html.html", "r", encoding="utf-8")
html = f.read()
f.close()
soup = BeautifulSoup(html, "lxml")
# section = soup.find("li")
# print(section.text)
# print(section.get("id"))
# sections = soup.find_all("li")
# for section in sections:
#     print(section.text,section.get("id"))
sections = soup.find_all("li",attrs={"class":"name"})
for section in sections:
    print(section.text,section.get("id"))

运行结果：

通过传入第二个参数，给出属性的限制，让我们精确定位到了第一列li的内容。

关于find()和find_all()要了解的东西就这么多。

四、CSS选择器介绍

几种常用选择器：

1. id选择器        #id值
2. 标签选择器       标签
3. 类选择器         .
4. 选择器分组       ,
5. 后代选择器       空格
6. 子选择器         父 > 子
7. 属性选择器       [属性=值]

其他更多选择器使用见：CSS 选择器 - CSS：层叠样式表 | MDN (mozilla.org)

利用css选择器来获取页面内容

和前面的find进行类比，利用选择器定位数据也使两个：

select_one(选择器) 使用选择器获取html文档中的标签, 拿一个
select(选择器) 使用选择器获取html文档中的标签, 拿一堆

举例：

sections = soup.select("li")
for section in sections:
    print(section.text,section.get("id"))

运行结果和上面第二个相同。

sections = soup.select("ol li")
for section in sections:
    print(section.text,section.get("id"))

运行结果：

sections = soup.select("ol > ol > li")
for section in sections:
    print(section.text,section.get("id"))

运行结果：

这里需要注意的是：后代选择器是拿到当前标签下的所有目标标签（不管有多少层），而子选择器只会往后面找一层。

写出一些其他的解释器语句并解释：

sections = soup.select("li[class='name']")
# 找到所有li标签中class='name'的元素
sections = soup.select("li[class='name'],div[class='job']")
# 找到所有class为"name"的li标签和class为"job"的div标签
section = soup.select_one("#10086")
# 语句的意思是找到id为10086的元素
### 但是注意，这条语句是会报错的，因为id选择器必须后面紧跟着CSS兼容的标识符，
### 而标识符不能以数字开头。因为这是随便写的HTML，正经网页不会出现这个问题

踩到的大坑：

在写选择器的时候，不要随便打空格，因为空格表示后代选择器，不然得不到结果。找半天都找不到错哪了。

2.爬贝壳租房信息

访问网址：成都武侯租房信息_成都武侯出租房源|房屋出租价格【成都贝壳租房】 (ke.com)

查看页面确定拿取的信息：

查看页面源代码看我们需要的信息是否存在：

显然信息在页面源代码中。那么我们开始进行爬虫：

通过在源代码中全局搜索：class="content__list--item--main"发现其中存在30个相同的标签，而页面的租房信息同样是30个，所以我们可以通过类选择器来定位。然后开始解析每一条数据。

完整代码：

import requests
from bs4 import BeautifulSoup
import xlwt


url = "https://cd.zu.ke.com/zufang/wuhou/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
resp = requests.get(url, headers=headers)
# print(resp.text)
main_page = resp.text
# 通过lxml解析器解析
soup = BeautifulSoup(main_page, "lxml")
# 定位到所有class='content__list--item--main'的div标签
sections = soup.select("div[class='content__list--item--main']")
total_info_list = []
# 遍历每一个section
for section in sections:
    # 拿到标题
    section_all_title = section.select_one("div > p > a").text.strip()
    # 标题中提取出房屋类型、朝向、具体位置
    info_lst = section_all_title.split(" ")
    section_title = info_lst[0]
    house_type = info_lst[1]
    house_direction = info_lst[2]
    # print(section_title,house_type)
    # print(section_title)
    # 定位位置信息
    t_section = section.select("div > p")[1]
    # print(t_section.text.strip())
    t_section_lst = t_section.select("a")
    tmp_lst = []
    # 保持和网页显示一致
    for s in t_section_lst:
        tmp_lst.append(s.text.strip())
    section_location = "-".join(tmp_lst)
    # print(section_location)
    section_price = section.select_one("div > span").text.strip()
    # print(section_price)
    total_info_list.append([section_title, section_location,house_type, house_direction, section_price])

# print(total_info_list)

book = xlwt.Workbook(encoding="utf-8", style_compression=0)
sheet = book.add_sheet('贝壳租房信息', cell_overwrite_ok=True)
col = ["名称","具体位置","户型","朝向","租金"]
for i in range(len(col)):
    sheet.write(0,i,col[i])
for j in range(len(total_info_list)):
    single_info = total_info_list[j]
    for k in range(len((single_info))):
        sheet.write(j+1,k,single_info[k])

book.save('贝壳租房信息成都武侯区.xls')

最终效果：

成功拿到了目标数据。

3.总结

今天学习了Python中bs4这个模块，了解了CSS选择器，踩了不少坑，最终成功爬取了贝壳租房信息，也算是成功了吧。

参考文章：

使用BeautifulSoup进行解析数据_beautifulsoup解析-CSDN博客

python爬虫之Beautifulsoup模块用法详解 - 知乎 (zhihu.com)

科男小林

关注

31
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python爬虫学习（打卡day4）

bs4 全名，是编写 python 爬虫常用库之一，BeautifulSoup4也是一个html/xml的解析器，主要用来解析 html 标签。1. id选择器 #id值2. 标签选择器标签3. 类选择器 .4. 选择器分组 ,5. 后代选择器空格6. 子选择器父 > 子7. 属性选择器 [属性=值]CSS 选择器 - CSS：层叠样式表 | MDN (mozilla.org)
复制链接

扫一扫