网络爬虫：Beautiful Soup库详解

最新推荐文章于 2022-12-24 11:23:20 发布

SmiledrinkCat

最新推荐文章于 2022-12-24 11:23:20 发布

阅读量1k

点赞数

分类专栏： Python网络爬虫文章标签：大数据 xml python 安全 http

本文链接：https://blog.csdn.net/SmiledrinkCat/article/details/105835090

版权

Python网络爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Beautiful Soup库的引用

Beautiful Soup库，也叫beautifulsoup4 或 bs4 约定引用方式如下，即主要是用BeautifulSoup 类

from bs4 import BeautifulSoup

import bs4

使用示例

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>", "html.parser")
# 亦可打开本地html文件
soup2 = BeautifulSoup(open("C://demo.html"), "html.parser")

其中 "html.parser"是bs4库的解析器

bs4库的4种解析器

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk, 'html.parser')	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk, 'lxml')	pip install lxml
lxml的XML解析器	BeautifulSoup(mk, 'xml')	pip install lxml
html5lib的解析器	BeautifulSoup(mk, 'html5lib')	pip install html5lib

bs4库的基本元素

1.Tag

元素说明：

标签，最基本的信息组织单元，分别用<>和标明开头和结尾

任何存在于HTML语法中的标签都可以用soup.<tag>访问获得

当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

使用示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")

print(soup.title)

# 打印输出
<title>demo page</title>

tag = soup.a
print(tag)

# 打印输出
<a class="py">Python</a>

2.Name

元素说明：

标签的名字，<p>…</p>的名字是'p'，使用格式：<tag>.name，类型为字符串

使用示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")

print(soup.a.name)

# 打印输出
'a'

3.Attributes

元素说明：

标签的属性，字典形式组织，使用格式：<tag>.attrs ，类型为字典类型

使用示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")
tag = soup.a

print(tag.attrs)

# 打印输出
{'id': 'link', 'class': ['py']}


print(tag.attrs['class'])

# 打印输出
['py']

4.NavigableString

元素说明：

标签内非属性字符串，<>…</>中字符串，使用格式：<tag>.string

使用示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")

print(soup.a)

# 打印输出
<a class="py" id="link">Python</a>


print(soup.a.string)

# 打印输出
'Python'

5.Comment

元素说明：

标签内字符串的注释部分，一种特殊的Comment类型

使用示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<b><!--This is a comment--></b><p><!--This is not a comment--></p>", "html.parser")

print(soup.b.string)

# 打印输出
'This is a comment'
# type(soup.b.string)为<class 'bs4.element.Comment'>


print(soup.p.string)

# 打印输出
'This is not a comment'
# type(soup.p.string)为<class 'bs4.element.NavigableString'>

bs4库的遍历功能

标签树的下行遍历

属性	说明
.contents	子节点的列表，将<tag>所有儿子节点存入列表
.children	子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

for child in soup.body.children:
    print(child)

for child in soup.body.descendants:
    print(child)

标签树的上行遍历

属性	说明
.parent	节点的父亲标签
.parents	节点先辈标签的迭代类型，用于循环遍历先辈节点

遍历所有先辈节点时，包括soup本身，所以使用时要区别判断

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo, "html.parser")

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

标签树的平行遍历

属性	说明
.next_sibling	返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling	返回按照HTML文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings	迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

平行遍历发生在同一个父节点下的各节点间

# 遍历后续节点
for sibling in soup.a.next_sibling:
    print(sibling)

# 遍历前续节点
for sibling in soup.a.previous_sibling:
    print(sibling)

SmiledrinkCat

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
网络爬虫：Beautiful Soup库详解

Beautiful Soup库的引用Beautiful Soup库，也叫beautifulsoup4 或 bs4 约定引用方式如下，即主要是用BeautifulSoup 类from bs4 import BeautifulSoupimport bs4使用示例from bs4 import BeautifulSoupsoup = BeautifulSoup("<h...
复制链接

扫一扫