Python_BeautifulSoup使用

最新推荐文章于 2024-05-06 13:00:35 发布

Ax阿轩

最新推荐文章于 2024-05-06 13:00:35 发布

阅读量211

点赞数

分类专栏： SpiderCrawl 文章标签： python 开发语言

本文链接：https://blog.csdn.net/weixin_42160053/article/details/125000674

版权

SpiderCrawl 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

文章目录

- - BeautifulSoup(from bs4 import BeautifulSoup)

BeautifulSoup(from bs4 import BeautifulSoup)

BeautifulSoup是Python的HTML或XML的解析库，可以解析页面中的信息。
更多详细操作见：官方文档、Beautiful Soup中文文档

1、准备工作

安装
- 安装bs4库：pip install bs4
- 安装lxml库：pip install lxml
学习建议
- 建议使用BeautifulSoup解析库中的lxml，其次使用html.parser
- 语法：soup = BeautifulSoup('HTML源代码', '解析库')
- 推荐常用 find() 方法或 find_all()方法来查找单个结果或多个结果
- 如果对CSS选择器熟悉的话，可以使用select()方法

2、BeautifulSoup解析器

Beautiful在解析时依赖解析器，它除了支持Python标准库中的HTML解析器外，还支持一些第三方库（比如lxml）。

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, ‘html.parser’)	python内置的标准库，执行速度适中	Python3.2.2之前的版本容错能力差
lxml HTML解析器	BeautifulSoup(markup, ‘lxml’)	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup ‘xml’)	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, ‘html5lib’)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢，不依赖外部拓展

3、节点选择器

主要是获取节点中的元素信息，例如：文字，属性，节点的名称等
注意事项：
- 使用标签名称作为属性只会为您提供该名称的第一个标签
更多详细请见：官方文档、中文文档

3.1 Going Down（往下走，子节点）

使用标签名称做导航，例如soup.head，soup.title等
.contents：代表获取当前标签下的直接子节点，返回列表类型。
.children：代表获取当前标签下的直接子节点，返回生成器迭代类型。
.descendants：代表获取当前标签下的直接子节点（文字内容），返回的是生成器类型。
.string：获取当前节点仅有一个标签的内容，当节点内容>2时，返回None
.strings：返回多个标签的内容，返回一个生成器，用for循环输出内容
.stripped_strings：返回多个标签的内容，返回一个生成器，用for循环输出内容。【完全由空格组成的字符串被忽略，字符串开头和结尾的空格被删除】
.text：返回多个标签的内容，并进行多个标签内容的拼接

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.、
    </p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')
print(soup.prettify())  # 调用prettify( )方法。这个方法可以把要解析的字符串以标准的缩进格式输出

'''
    获取一些节点属性
'''
print(soup.head)  # 获取head节点元素 <head><title>The Dormouse's story</title></head>
print(soup.title)  # 获取title节点元素 <title>The Dormouse's story</title>
print("soup.title的类型为：", type(soup.title))  # bs4.element.Tag类型
'''
    string如果节点下没有文本数据返回为None，又如果节点下的文本书>=2的话，会返回None，一个的话会返回文本数值
    text如果节点下没有文本数据返回为空，有多个的话，返回文本的拼接
'''
print(soup.title.string)  # string属性获取节点文本内容  The Dormouse's story
print(soup.title.text)  # text属性获取节点文本内容  The Dormouse's story
print(soup.title.name)  # name属性获取节点名称
print(soup.p.attrs)  # attrs属性获取节点属性，返回字典
print(soup.p['class'])  # 获取节点属性值
print(soup.body.b)  # 获取body标签下第一个b标签  <b>The Dormouse's story</b>
print('\n')

'''
    .contents和.children
    .descendants：.contentsand.children属性只考虑标签的 直接子代,其中标签的内容其实也算子项，这个属性可以获取内容
'''
print(soup.p.contents, type(soup.p.contents))  # 获取直接子节点，可以调用contents属性，返回形式是列表
print(soup.p.children, type(soup.p.children))  # 获取直接子节点，可以调用children属性，返回形式是生成器迭代类型，用for循环输出内容
print(list(enumerate(soup.p.children)))  # [(0, <b>The Dormouse's story</b>), (1, <a href="#">   123</a>)]
print(soup.p.descendants, type(soup.p.descendants))  # 获取直接子孙节点，可以调用descendants属性，返回形式是生成器类型，用for循环输出内容
print(list(enumerate(soup.p.descendants)))  # [(0, <b>The Dormouse's story</b>), (1, "The Dormouse's story")]
print('\n')

'''
    .string：如果标签只有一个标签时，可返回该标签的内容
    .strings和stripped_strings
        strings：返回多个标签的内容，返回一个生成器，用for循环输出内容
        stripped_strings：返回多个标签的内容，返回一个生成器，用for循环输出内容。【完全由空格组成的字符串被忽略，字符串开头和结尾的空格被删除】
    .text：返回多个标签的内容，并进行多个标签内容的拼接
'''
print(soup.b.string)
print(type(soup.strings))  # 返回生成器<class 'generator'>
print(list(enumerate(soup.strings)))
print(type(soup.stripped_strings))  # 字符串往往有很多额外的空格，您可以使用.stripped_strings生成器来删除它们, 返回生成器<class 'generator'>
print(list(enumerate(soup.stripped_strings)))
print(soup.text)  # text属性获取文本内容

3.2 Going up（往上走，父节点）

.parent：使用该属性访问元素的父级。
.parents：使用该属性访问元素的所有父级，返回生成器类型。

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.、
    </p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.a.parent)  # 获取第一个a节点的父节点元素，可以调用parent属性
print(soup.a.parents)  # 获取祖先节点，可以调用parents属性，返回生成器类型
for i, parent in enumerate(soup.a.parents):
    if parent is None:
        print(i, parent)
    else:
        print(i, parent.name)
        
# 输入效果
# <p class="story">Once upon a time there were three little sisters; and their names were
#         <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#         <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#         <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#         and they lived at the bottom of a well.、
#     </p>
# <generator object PageElement.parents at 0x000001FDFDDB6F90>
# 0 p
# 1 body
# 2 html
# 3 [document]

3.3 Going sideways（横着走，兄弟节点）

next_sibling：同级节点，下一个兄弟节点
previous_sibling：同级节点，上一个兄弟节点
next_siblings ：同级节点，当前节点下的所有兄弟节点
previous_siblings：同级节点，当前节点上的所有兄弟节点

from bs4 import BeautifulSoup

html_doc = """
<a><b>text1</b><c>text2</c><d>test3</d></a>
"""

soup = BeautifulSoup(html_doc, 'lxml')

# print(soup.prettify())
print('Next sibling: ', soup.b.next_sibling)  # 获取节点下一个兄弟元素，调用next_sibling属性
print('Prev sibling: ', soup.c.previous_sibling)  # 获取节点上一个兄弟元素，调用previous_sibling属性
print('Next siblings', list(enumerate(soup.b.next_siblings )))  # 获取节点所有后面兄弟元素，调用next_siblings属性
print('Prev siblings', list(enumerate(soup.d.previous_siblings)))  # 获取节点所有前面兄弟元素，调用next_siblings属性

# 输出结果
# Next sibling:  <c>text2</c>
# Prev sibling:  <b>text1</b>
# Next siblings [(0, <c>text2</c>), (1, <d>test3</d>)]
# Prev siblings [(0, <c>text2</c>), (1, <b>text1</b>)]

3.4 Going back and forth（来回往返）【这里使用的极少，可以粗略运行下就行】

.next_element：跳转到上一个节点
.previous_element：跳转到上一个节点
.next_elements：使文档节点向后移动
.previous_elements：使文档节点向前移动

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.、
    </p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')
last_a_tag = soup.find("a", id="link3")
print(last_a_tag)
print(last_a_tag.next_sibling)
print(last_a_tag.previous_element)
print(last_a_tag.next_element)
print(list(enumerate(last_a_tag.previous_elements)))
print(list(enumerate(last_a_tag.next_elements)))

4、方法选择器

常用的方法就两个，其他了解即可：find_all()和find()

find_all( name , attrs , recursive , string , limit , **kwargs )：查找多个
find( name , attrs , recursive , string , **kwargs )：查找单个
find_parents( name , attrs , string , limit , **kwargs )：返回所有祖先节点
find_parent( name , attrs , string , **kwargs )：直接返回父节点
find_next_siblings( name , attrs , string , limit , **kwargs )：返回后面所有兄弟节点
find_next_sibling( name , attrs , string , **kwargs )：返回后面第一个兄弟节点
find_previous_siblings( name , attrs , string , limit , **kwargs )：返回前面所有兄弟节点
find_previous_sibling( name , attrs , string , **kwargs )：返回前面第一个兄弟节点
find_all_next( name , attrs , string , limit , **kwargs )：返回节点后所有符合条件的节点
find_next( name , attrs , string , **kwargs )：返回第一个符合条件的节点
find_all_previous( name , attrs , string , limit , **kwargs )：返回节点前面所有符合条件的节点
find_previous( name , attrs , string , **kwargs )：返回第一个符合条件的节点

from bs4 import BeautifulSoup

html_doc = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html_doc, 'lxml')

'''
    findall()和find()
'''
print(soup.find_all('ul'))  # 查找所有ul节点，返回列表类型
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
    for li in ul.find_all('li'):
        print(li.string)
print(soup.find_all(attrs={'id': 'list-1'})) # 查询id为list-1的所有节点，返回列表类型
print(soup.find_all(id='list-1'))
print(soup.find_all(attrs={'class': 'element'}))
print(soup.find_all(class_='element'))  # 由于class是python的关键字，所以后面需要加一个下划线
print(soup.find_all('li', {'class': 'element'}))  # 等价于上面的一行
print(soup.find('ul'))  # find()查找的单个元素, 返回第一个匹配成功的节点

5、CSS选择器

.select()使用 SoupSieve 对已解析文档运行 CSS 选择器并返回所有匹配元素的方法。
如果对CSS选择器比较熟悉的话，select()可以解决大部分的匹配问题。

from bs4 import BeautifulSoup

html_doc = '''
<div class="panel">
    <div class="panel-heading">
        <h4 class="four">Hello<b> H4测试 </b></h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html_doc, 'lxml')

# 常用的CSS选择器案例
soup.select('h4')[0].get_text()  # 获取h4节点的内容
soup.select('h4')[0].get_text('|')  # 指定文本内容的分隔符
soup.select('h4')[0].get_text('|', strip=True)  # 去除文本内容前后的空白
soup.select('h4')[0].get('class')  # 获取class属性值列表
soup.select("#list-1 .element:first-of-type ~ .element")  # 获取id为list-1节点下第一个li之后的所有兄弟节点
soup.select("#list-1 .element:first-of-type + .element")  # 获取id为list-1节点下第一个li之后的第一个兄弟节点
soup.select("#list-1 .element:first-of-type")  # 获取id为list-1节点下第一个li【常用这个】
soup.select("#list-1 .element:last-child")  # 获取id为list-1节点下最后一个li【常用这个】
soup.select("#list-1 .element:last-of-type")  # 获取id为list-1节点下最后一个li
soup.select("#list-1 .element:nth-of-type(2)")  # 获取id为list-1节点下第二个li
soup.select("#list-1 .element:nth-child(2)")  # 获取id为list-1节点下第二个li【常用这个】

soup.select('.list')  # 符合class 包含list的所有标签
soup.select('#list-1')   # 符合id=list-1的标签
soup.select('ul[class="list"]')  # 选择ul标签，其属性class=list的所有标签
soup.select('ul[id^="list"]')  # 选择ul标签，其id属性以list开头
soup.select('div[class$="body"]')  # 选择div标签，其class属性以body结尾
soup.select('ul[class*="small"]')  # 选择ul标签，其class属性包含small