【Python】解析库——BeautifulSoup

最新推荐文章于 2023-03-04 21:59:20 发布

Skyey_6

最新推荐文章于 2023-03-04 21:59:20 发布

阅读量123

点赞数 1

分类专栏： python 文章标签： python html

本文链接：https://blog.csdn.net/Skyey_6/article/details/112885547

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

解析器

BeautifulSoup支持Python标准库中的HTML解析器，还支持一些第三方的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 及 3.2.2前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”]) BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

节点选择器

选择元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 初始化一个BeautifulSoup对象（初始化时自动更正HTML字符串的格式）
soup = BeautifulSoup(html, 'lxml')  # 第一个参数传入字符串（body和html节点没有闭合），第二个参数传入解析器的类型

print(soup.prettify())  # 以标准的缩进格式输出
print(soup.title.string)  # soup.title可以选出HTML中的title节点，再调用string属性得到里面的文本
print(soup.title)
print(type(soup.title))   # <class 'bs4.element.Tag'>
print(soup.head)
print(soup.p)   # 有多个p节点，只选择到第一个匹配的节点

提取信息

print(soup.title.name)  # 调用name属性获取节点名称
print(soup.p.attrs)     # 调用attrs获取节点所有属性
print(soup.p.attrs['name'])     # 根据key获取value
print(soup.p['name'])   # 直接在节点元素后面加中括号，传入属性名就可以获得属性值了
print(soup.p['class'])  # 注意判断返回结果的类型
print(soup.p.string)    # 获取内容

嵌套选择

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)	# The Dormouse's story

关联选择

html = """
<html>
<head>
<title>The Dormouse's story</title></head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)  # contents属性得到的结果是直接子节点的列表

print(soup.p.children)  # 用children属性来选择，返回的是生成器类型
for i, child in enumerate(soup.p.children):
    print(i, child)

print(soup.p.descendants)   # 用descendants属性会得到所有的子孙节点
for i, descendant in enumerate(soup.p.descendants):
    print(i, descendant)

类似地，还有：
parent(父节点)、parents(所有祖先节点) 、next_sibling(下一个兄弟节点)、previous_sibling(上一个兄弟节点)、next_siblings(前面所有兄弟节点)、previous_siblings(后面所有兄弟节点)

如果返回结果是单个节点，则可以调用string、attrs等属性获得其文本和属性。

如果返回结果是多个节点的生成器，则可以转为列表后取出某个元素，然后再调用string、attrs等属性获得其对应节点的文本和属性。

方法选择器

find_all()

<html>
<head>
    <title>index</title>
</head>
<body>
    <div>
        <ul>
            <li class="item-0" id="flask"><a href="link1.html">first item</a></li>
            <li class="item-1"><a href="link2.html">second item</a></li>
            <li class="item-inactive"><a href="link3.html">third item</a></li>
            <li class="item-1"><a href="link4.html">fourth item</a></li>
            <li class="item-0"><a href="link5.html">fifth item</a>
        </ul>
    </div>
    <div>
        <ul>
            <li><a href="hello.html"> hello world </a></li>
            <li><a href="hello2.html"> hello world2 </a></li>
        </ul>
    </div>
</body>
</html>

根据节点名查询（name属性）

soup = BeautifulSoup(open('find.html'), 'lxml')
print(soup.find_all(name="li"))     # 查询所有li节点
print(soup.find_all(name="li")[0])
print(type(soup.find_all(name="li")[0]))  # <class 'bs4.element.Tag'>

for li in soup.find_all(name='li'):
    print(li.a.attrs)
# {'href': 'link1.html'}
# {'href': 'link2.html'}
# {'href': 'link3.html'}
# {'href': 'link4.html'}
# {'href': 'link5.html'}
# {'href': 'hello.html'}
# {'href': 'hello2.html'}

根据属性查询（attrs属性）
对于一些常用的属性，比如id和class等，可以不用attrs来传递。
class是关键字，后面需要加一个下划线

print(soup.find_all(attrs={'class': 'item-0'}))

print(soup.find_all(id='flask'))
print(soup.find_all(class_='item-inactive'))

匹配节点内的文本（text属性）
传入的可以是字符串，也可以是正则表达式对象。

print(soup.find_all(text=re.compile(r'hello')))
# [' hello world ', ' hello world2 ']

find()
find()与find_all()类似，只不过find()返回的是单个元素（第一个匹配的元素），而find_all()分会的是所有匹配元素组成的列表。

Skyey_6

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【Python】解析库——BeautifulSoup

BeautifulSoup支持Python标准库中的HTML解析器，还支持一些第三方的解析器解析器使用方法优势劣势Python标准库BeautifulSoup(markup, “html.parser”)Python的内置标准库执行速度适中文档容错能力强Python 2.7.3 or 3.2.2)前的版本中文档容错能力差lxml HTML 解析器BeautifulSoup(markup, “lxml”)速度快文档容错能力强需要安装C语言库lxml XML 解
复制链接

扫一扫