爬虫(09)bs4(下) select()方法+修改文档树+天气信息案例

最新推荐文章于 2024-09-26 20:57:59 发布

辉子2020

最新推荐文章于 2024-09-26 20:57:59 发布

阅读量1.2k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/m0_46738467/article/details/112681967

版权

本文介绍了如何使用BeautifulSoup的select()方法通过CSS选择器提取HTML数据，包括按标签、类名、IP查找，以及组合提取。接着展示了如何修改文档树，包括更改tag名称、属性、添加和删除元素。最后，文章详细讲解了爬取天气信息的案例，包括思路分析、实践步骤，以及解决过程中遇到的省会城市名字和港澳台地区乱码问题。

摘要由CSDN通过智能技术生成

文章目录

第九章 bs4(下) 天气信息案例

第九章 bs4(下) 天气信息案例

1. select()方法

我们也可以通过css选择器来提取数据，但需要我们掌握一些css语法。具体可以参考网页
css选择器参考手册。
示例代码（后面的代码都是在这个代码的基础上继续的）：

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

1.1 通过标签来提取

# 提取并打印所有的标签a
print(soup.select('a'))

结果以列表返回所有的标签a，可以通过遍历列表取出所需元素。

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.2 通过类名来查找

比如我们要查找所有类名为"sister"的a标签。

# 提取并打印所有class="sister"的标签
print(soup.select('.sister'))

结果以列表形式返回

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.3 通过ip来查找

# 通过id来提取   #id
print(soup.select('#link1'))

结果

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1.4 组合提取

我们提取p标签下的所有a标签

# 组合标签提取
print(soup.select('p>a'))

结果以列表返回

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 提取p标签下的id=link1的标签
print(soup.select('p #link1'))

结果

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

选择所有p元素和a元素

# 选取所有p元素和a元素
print(soup.select('p,a'))

结果选出了所有p标签和a标签，以列表返回

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]

选择所有p元素内的a元素

# 选取所有p元素下的a元素
print(soup.select('p a'))

结果以列表返回

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

选取所有带有href属性的元素

# 选取所有带有href属性的元素
print(soup.select('[href]'))

结果以列表返回

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 选取所有class = "sister"的元素
print(soup.select('[class=sister]'))

结果以列表返回

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 从属性为title的标签中取出文本
print(soup.select('.title')[0].get_text())

因为print(soup.select(’.title’)）返回的是列表，所以这里需要取出列表里面的元素再.get_text()。

The Dormouse's story

1.5 案例

我们把上节课的示例用select()方法操作一下

from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href=