【爬虫】BeautifulSoup4的使用、常用解析器、find()和find_all()、select()

冰冷的希望

已于 2024-11-13 11:00:56 修改

阅读量4.4k

点赞数 1

分类专栏：爬虫文章标签： beautifulsoup bs4 爬虫 beautifulsoup4 css lxml html5lib

于 2020-11-08 18:00:48 首次发布

本文链接：https://blog.csdn.net/qq_39147299/article/details/109408286

版权

爬虫专栏收录该内容

14 篇文章

订阅专栏

本文介绍BeautifulSoup4，一种用于解析和提取HTML/XML数据的强大工具。文章涵盖了安装方法、基本使用案例，以及通过find()、find_all()和select()等方法进行元素定位的详细说明。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.BeautifulSoup4

BeautifulSoup是一个强大的HTML/XML的解析器，我们主要用它来解析和提取 HTML/XML数据

优点： 使用简单，支持CSS选择器、Python标准库中的HTML解析器，也支持lxml的 XML解析器，以及兼容性超级好的html5lib解析器

缺点： 会遍历整个DOM树，时间和内存开销都挺大的，性能不及lxml

官方API文档
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

2.简单使用

种类
eautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种，分别是Tag、NavigableString、BeautifulSoup、Comment

安装

pip install beautifulsoup4

使用

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.title)  # 查找title标签
print(soup.title.get_text())  # 获取title的文本
print(soup.a.get("href"))  # 获取第一个a元素的href属性值，tag类型可以当做字典使用
print(soup.p)  # 第一个p元素

print(soup.find(name='a'))  # 查找第一个a元素
print(soup.find_all(name="a"))  # 查找所有a元素
print(soup.find_all(name="a", limit=10))  # 查找所有a元素，但限制返回个数

# print(soup.prettify())  # 打印整个HTML

3.常用解析器

下面是beautifulSoup4支持的解析器，注意lxml和html5lib需要额外安装才能使用

解析器	使用方法	优点	缺点
Python标准库	`BeautifulSoup(markup, "html.parser")`	1.Python的内置标准库 2.执行速度较快 3.容错能力强	1.速度没有 lxml 快 2.容错没有 html5lib强
lxml HTML	`BeautifulSoup(markup, "lxml")`	1.速度快 2.容错能力强	需要安装C语言库
lxml XML	`BeautifulSoup(markup, ["lxml-xml"])` 或`BeautifulSoup(markup, "xml")`	1.速度快 2.唯一支持 XML 的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	1.最好的容错性 2.以浏览器的方式解析文档 3.生成 HTML5 格式的文档	1.速度慢 2.需要额外安装

如果你要使用的是lxml或html5lib解析器，那你就安装一下

pip install lxml
pip install html5lib

4.find()和find_all()

find(name, attrs, recursive, text, **kwargs)
find_all(name, attrs, recursive, text, **kwargs)

这两个函数都是查找元素的，从它们的API参数可以知道用法应该是一样的，不同点就是find()只匹配第一个符合要求的元素，找不到返回None，而find_all()则匹配所有，找不到返回空列表

soup.find_all(name='a')  # # 查找所有 a 标签
soup.find_all(name=['a','b'])  # 查找所有 a 标签和 b 标签
soup.find_all(name=re.compile("^b"))  # 以 b 开头的标签查找

soup.find_all("a", class_="sister")  # 查找class为sister的a标签

soup.find_all(attrs={"属性名":"值"})

soup.find_all(text="Elsie")  # 通过文本内容查找

soup.find_all(id='link2')  # 通过id查找元素
soup.find_all(class_="sister")  # 通过class查找元素

soup.find_all("a", limit=2)  # 限制返回2个

5.select()

select()函数可以通过 css 样式选择器进行元素查找

print(soup.select(".sister"))  # 查找类为sister的元素，返回一个列表
print(soup.select("#link1"))  # 查找id为link1的元素，返回一个列表
print(soup.select("a"))  # 查找a标签，返回一个列表
print(soup.select("p[class=title]"))  # 查找class为title的p标签
print(soup.select("p #link2"))  # 查看在p标签里的id为link2的p标签

关于css选择器可以查看之前的文章
【css】css常用的选择器
 【爬虫】元素定位（xpath、css）

5.Tag对象

我们在使用bs4获取到的元素对象基本上都是Tag对象（也就是bs4.element.Tag对象），我这里列举出了一些常用的用法

5.1 获取文本和属性

soup = BeautifulSoup(res.text, "html5lib")
a_list = soup.select("table > tbody > tr > td > b > a")
a_tag = a_list[0]
a_tag.name  # Tag名（标签名）
a_tag.text  # Tag的文本内容
a_tag.string  # 获取文本内容
a_tag.strings  # 获取多行文本内容
a_tag.stripped_strings  # 获取多行文本内容并去除空行
a_tag.attrs  # Tag的全部属性（返回字典）
a_tag.get("href")  # 获取某个属性值
a_tag.has_attr("href")  # 是否有某个属性

5.2 访问相关元素

soup = BeautifulSoup(res.text, "html5lib")
tb_tag = soup.find("table")
# tb_tag = soup.find_all("table", limit=1)[0]
tb_tag.tbody.tr  # 访问子元素
tb_tag.parent  # 访问父元素
tb_tag.parents  # 访问全部父元素
tb_tag.children  # 访问子元素
tb_tag.previous_sibling  # 访问上一个兄弟元素，类似.previous_element
tb_tag.previous_siblings  # 访问前面的全部兄弟元素，类似.previous_elements
tb_tag.next_sibling  # 访问下一个兄弟元素，类似.next_element
tb_tag.next_siblings  # 访问后面的全部兄弟元素，类似.next_elements
tb_tag.next_element  # 访问