爬虫笔记15：bs4中的select()方法、修改文档树

最新推荐文章于 2024-03-29 23:51:53 发布

进阶的阿牛哥

最新推荐文章于 2024-03-29 23:51:53 发布

阅读量1.2k

点赞数

本文链接：https://blog.csdn.net/weixin_49167820/article/details/116859209

版权

一、select（）方法
我们可以通过css选择器的方式来提取数据。但是需要注意的是这里面需要我们掌握css语法。

select（）返回的是列表形式。

1、常用的几个查找方式：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,"lxml")
# 1 找a标签
print(soup.select('a')) # 通过标签的名称查找
# 2 通过类名来查找:class="sister"
print(soup.select('.sister'))
# 3 通过id查找:id="link1"
print(soup.select('#link1'))
# 4 特殊的查找方式：选择父元素是 <head> 的所有 < title> 元素。注意是'head > title'，不是'head' > title'
print(soup.select('head > 'title')) 
# 5 获取文本内容
print(soup.select('title')[0].string)
print(soup.select('title')[0].get_text())

结果：
在这里插入图片描述
更加详细的介绍，可以参考：https://www.w3school.com.cn/cssref/css_selectors.asp
实际上掌握以上几个就够用了。

2、获取所有class=even的tr标签

trs = soup.select('.even')
print(trs)

或者：

trs = soup.select('tr[class="even"]')
print(trs)

3、stripped_strings返回的是一个generator生成器，通过list()显示出来。

from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师（深圳）</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师（深圳）</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html,"lxml")
trs = soup.select('tr')[1:]
#print(trs)
tr=trs[0]
print(tr)
print(type(tr.stripped_strings)) 
print(list(tr.stripped_strings))    #stripped_strings返回的是一个generator生成器，通过list()显示出来

结果：
在这里插入图片描述
结合for循环：

二、修改文档树
• 修改tag的名称和属性
• 修改string：属性赋值,就相当于用当前的内容替代了原来的内容
• append() 向tag中添加内容,就好像Python的列表的 .append() 方法
• decompose() 删除段落，对于一些没有必要的文章段落我们可以给他删除掉

1、修改tag的名称和属性
在这里插入图片描述
2、修改string：属性赋值,就相当于用当前的内容替代了原来的内容

3、 append() 向tag中添加内容,就好像Python的列表的 .append() 方法。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,"lxml")
tap_p = soup.p
print(tap_p)
#print(tap_p.string)
tap_p.string = 'you need python'
print(tap_p)

tap_p.append('123')
print(tap_p)

结果：
在这里插入图片描述
4、decompose() 删除段落

进阶的阿牛哥

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
爬虫笔记15：bs4中的select()方法、修改文档树

一、select（）方法我们可以通过css选择器的方式来提取数据。但是需要注意的是这里面需要我们掌握css语法。1、常用的几个查找方式：from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormo
复制链接

扫一扫