Python爬虫初探（八）——爬虫之Beautifulsoup4介绍（Ⅱ）

最新推荐文章于 2022-08-14 20:54:00 发布

brilliant666

最新推荐文章于 2022-08-14 20:54:00 发布

阅读量334

点赞数 2

分类专栏： python 爬虫文章标签： python html

本文链接：https://blog.csdn.net/brilliant666/article/details/107659623

版权

python 同时被 2 个专栏收录

26 篇文章 12 订阅

订阅专栏

爬虫

19 篇文章 3 订阅

订阅专栏

一、遍历功能补充
 二、搜索树
 三、find_all() 和 find()
四、修改文档树

上一章呢，咱们讲到了bs4的遍历功能，介绍了遍历子节点。这次接着讲遍历父节点及其他的功能。

一、遍历功能补充

1.遍历父节点

在此之前，也要先导入模块。

from bs4 import BeautifulSoup
import re

parent 直接获得父节点
parents 获取所有的父节点

还是以上一章文档为例：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

先来一个bs4对象

soup = BeautifulSoup(html_doc,'lxml')

（1）下面来试试遍历父标签：

# 找到title标签
title_tag = soup.title
print(title_tag)
# 结果：<title>The Dormouse's story</title>

# 找到title标签的父标签
print(title_tag.parent)
# 结果：<head><title>The Dormouse's story</title></head>

看看能遍历到html标签的父标签吗？

print(soup.html.parent)

通过运行，我们发现父标签就是html标签本身
在这里插入图片描述
（2）获取a标签的父标签，并遍历

a_tag = soup.a
for p in a_tag.parents:
     print(p)
     print('--------------------')

看看结果：

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
--------------------
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
--------------------

结果太长，没有全部显示，小伙伴们可以试试。

2.遍历兄弟节点

next_sibling 下一个兄弟结点
previous_sibling 上一个兄弟结点
next_siblings 下一个所有兄弟结点
previous_siblings 上一个所有兄弟结点

这次来一个新的结构

html = '<a><b>bbb</b><c>ccc</c></a>'
soup2 = BeautifulSoup(html,'lxml')

（1）将html结构美化、补全

print(soup2.prettify())
# 结果：
<html>
 <body>
  <a>
   <b>
    bbb
   </b>
   <c>
    ccc
   </c>
  </a>
 </body>
</html>

（2）找到b标签、b标签的下一个兄弟标签

b_tag = soup2.b

print(b_tag)
# 结果：<b>bbb</b>

print(b_tag.next_sibling)
# 结果：<c>ccc</c>

（3）找到c标签的上一个兄弟标签

# print(c_tag.previous_sibling)
# 结果：<b>bbb</b>

（4）找到a标签的下一个所有兄弟标签
现在又用回咱们开头的那个复杂的html

a_tag = soup.a
for x in a_tag. next_siblings:
    print(x)

结果如下：

,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
;
and they lived at the bottom of a well.

（5）找到a标签的上一个所有兄弟标签（这次a标签变了）

# 定位到id="link3"的a标签
a_tag = soup.find(id="link3")

for x in a_tag.previous_siblings:
    print(x)

结果如下：

 and

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
,

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Once upon a time there were three little sisters; and their names were

可以发现，找到的上一个a标签是从后往前看的。

二、搜索树

在搜索树中，会用到这么几种过滤器：
字符串过滤器
正则表达式过滤器：
我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法，可以实现一个正则表达式的一个过滤器的搜索。
列表过滤器
True过滤器
方法过滤器

下面来介绍一下它们的用法：
字符串过滤器

# 找一个直接返回结果
a_tag = soup.find('a') 
print(a_tag)

结果如下

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

# 找所有 返回列表
a_tags = soup.find_all('a') 
print(a_tags)

结果如下,是一个列表。

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" 
href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" 
href="http://example.com/tillie" id="link3">Tillie</a>]

正则表达式过滤器
要提前导入re模块

print(soup.find(re.compile('title')))
# 结果：<title>The Dormouse's story</title>

print(soup.find_all(re.compile('t')))
# 结果：以列表的形式返回整个html文档

列表过滤器

# 寻找所有的p标签和a标签
print(soup.find_all(['p','a']))
# 寻找所有的title标签和b标签
print(soup.find_all(['title','b']))

上面的结果均以列表的形式返回。

True过滤器

print(soup.find_all(True))

方法过滤器

# 定义一个函数方法
def fn(tag):
    return tag.has_attr('class')
print(soup.find_all(fn))

在搜索树中，常用到这些过滤器。

三、find_all() 和 find()

1.find_all() 和 find()

find_all()方法，以列表形式返回所有的搜索到的标签数据。
find()方法，返回搜索到的第一条数据。

这是find_all()方法的参数：

def find_all(self, name=None, attrs={}, recursive=True, text=None,
                 limit=None, **kwargs):

其中，name : tag名称
attr : 标签的属性
recursive : 是否递归搜索
text : 文本内容
limit : 限制返回条数
kwargs : 关键字参数

下面介绍一下它们的用法：
通过tag名称查找

a_tags = soup.find_all('a')

p_tags = soup.find_all('p','title')
print (p_tags)
# 结果：[<p class="title"><b>The Dormouse's story</b></p>]

通过关键字参数查找

print(soup.find_all(id='link1'))
# 结果：[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过限制返回条数查找，limit等于多少条就限制几条，要满足0≤数字≤最大条数

# 限制0条（相当于不限制）
print(soup.find_all('a',limit = 0))

设置是否递归搜索

print(soup.find_all('a',recursive=False))
# 结果：返回一个空列表

通过文本内容搜索

print(soup.find_all(text = re.compile('story')))
# 结果：["The Dormouse's story", "The Dormouse's story"]

2.延伸

这些方法跟之前遍历单个、多个子节点、父节点类似，小伙伴们可以试一试。
find_parents() 搜索所有父亲
find_parrent() 搜索单个父亲
find_next_siblings() 搜索所有兄弟
find_next_sibling() 搜索单个兄弟
find_previous_siblings() 往上搜索所有兄弟
find_previous_sibling() 往上搜索单个兄弟
find_all_next() 往下搜索所有元素
find_next()往下查找单个元素

四、修改文档树

有的时候我们想要修改文档树，可以用下面这些方法：
修改tag的名称和属性

p_tag = soup.p
# print(p_tag)
# 结果：<p class="title"><b>The Dormouse's story</b></p>

p_tag.name = 'w' # 修改标签的名称
p_tag['class'] = 'content' # 修改属性
print(p_tag)
# 结果：<w class="content"><b>The Dormouse's story</b></w>

修改string，赋值，就相当于用当前的内容替代了原来的内容

p_tag = soup.p
print(p_tag.string)
# 结果：The Dormouse's story

p_tag.string = 'you need python'
print(p_tag.string)
print(p_tag)
# 结果：
you need python
<p class="title">you need python</p>

append() ，向tag中添加内容，就好像Python的列表的 .append() 方法

p_tag.append('hahaha')
print(p_tag)
# 结果：<p class="title"><b>The Dormouse's story</b>hahaha</p>

decompose() ，修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

# 删掉class为title的标签
r = soup.find(class_ = 'title')
r.decompose()
print(soup)

结果如下：

<html><head><title>The Dormouse's story</title></head>
<body>

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

可以发现，第二条标签已经被删除。这样，咱们就做到了修改文档树。

至此，bs4模块就介绍完毕。下一章咱们就来继续讨论bs4的实战操作。

第一篇：Python的要点(搭建环境、安装配置、第三方库导入方法详细过程)
第二篇：Python爬虫初探（一）——了解爬虫
第三篇：Python爬虫初探（二）——爬虫的请求模块
第四篇：Python爬虫初探（三）——爬虫之正则表达式介绍
第五篇：Python爬虫初探（四）——爬虫之正则表达式实战（爬取图片）
第六篇：Python爬虫初探（五）——爬虫之xpath与lxml库的使用
第七篇：Python爬虫初探（六）——爬虫之xpath实战（爬取高考分数线信息）
第八篇：Python爬虫初探（七）——爬虫之Beautifulsoup4介绍（Ⅰ）

brilliant666

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫初探（八）——爬虫之Beautifulsoup4介绍（Ⅱ）

一、遍历功能补充二、搜索树三、find_all() 和 find()四、修改文档树上一章呢，咱们讲到了bs4的遍历功能，介绍了遍历子节点。这次接着讲遍历父节点及其他的功能。一、遍历功能补充1.遍历父节点在此之前，也要先导入模块。from bs4 import BeautifulSoupimport reparent 直接获得父节点parents 获取所有的父节点还是以上一章文档为例：html_doc = """<html><head>&l
复制链接

扫一扫

专栏目录