BeautifulSoup4库

最新推荐文章于 2024-08-06 23:23:32 发布

小xiao露

最新推荐文章于 2024-08-06 23:23:32 发布

阅读量559

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_36407399/article/details/83472736

版权

BeautifulSoup4是一个HTML/XML解析库，基于DOM树结构，提供人性化的API和CSS选择器支持。相较于lxml，其性能较低，但在解析HTML时简单易用。主要功能包括：Tag对象用于表示HTML标签，NavigableString存储标签内容，BeautifulSoup表示整个文档内容，Comment处理文档注释。常用方法有find、find_all和select，便于搜索和遍历文档树。

摘要由CSDN通过智能技术生成

1、BeautifulSoup4库介绍：

和lxml一样，BeautifulSoup4也是一个HTML/XML的解析器，主要的功能也是如何解析和提取HTML/XML数据。

2、BeautifulSoup4与lxml的区别：

lxml只会局部遍历，而BeautifulSoup4是基于HTML DOM（Document Object Model）的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml。

3、BeautifulSoup4优点：

BeautifulSoup4用来解析HTML比较简单，API非常人性化，支持CSS选择器，python标准库中的HTML解析器，也支持lxml的XML解析器。

4、BeautifulSoup4安装

conda install bs4  
或：pip install bs4

5、几大解析工具对比：

解析工具	解析速度	使用难度
BeautifulSoup	最慢	最简单
lxml	快	简单
正则	最快	最难

6、BeautifulSoup四个常用对象

6.1、Tag：通俗讲就是HTML中的一个个标签。

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html,'lxml')#创建BeautifuSoup对象，'lxml'使用的解析器
print(soup.title)#查找title标签的内容
print(soup.head)#查找head标签的内容
print(soup.a)#查找a标签的内容
print(soup.p)#查找p标签的内容
print(type(soup.p))#查看类型

从以上例子可以看出，利用soup加标签名可以轻松的获取这些标签的内容，这些对象类型是bs4.element.Tag。但是，它查找的是在所有内容中的第一个符合要求的标签。

Tag两个重要属性：name和attrs

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup=BeautifulSoup(html,'lxml')#创建BeautifuSoup对象，'lxml'使用的解析器
print(soup.name)#soup对象本身比较特殊，它的name即为[document]
print(soup.head.name)#head对于其他内部标签，输出的值为标签本身的名称
print(soup.p.attrs)#把p标签的所有属性打印输出，得到的类型是一个字典
print(soup.p.get('class'))#与print(soup.p['class'])写法等价，得到class属性的值
soup.p['class']='newclass'#对这些属性和内容等进行修改
print(s