一、概念
Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据。
二、Beautiful Soup库的安装
Windows平台: “以管理员身份运行”cmd
执行pip install beautifulsoup4
测试一下:
# 代码
from bs4 import BeautifulSoup
demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())
结果如下:
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">
Basic
Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
使用 Beautiful Soup 库主要就是下面两行代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>', 'html.parser')
【注】demo 文本可以通过 requests 库方法获得:
三、Beautiful Soup 库的基本元素
四、BeautifulSoup 类的基本元素
1) Tag 标签:
任何存在于HTML语法中的标签都可以用 soup.tag 访问获得,当HTML文档中存在多个相同 tag 对应内容时,soup.tag 返回第一个。
2) Tag 的 name:
每个 tag 都有自己的名字,通过 tag.name 获取,字符串类型。
3) Tag 的 attrs (属性):
4) Tag 的 NavigableString:
5) Tag 的 Comment:
五、基于 bs4 库的 HTML 内容遍历方法
对于之前的 html 文本 demo:
demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
其 HTML 基本格式如下:
可以形成如下的标签树:
对于标签树,有下行遍历、上行遍历和平行遍历三种遍历方式:
【注】平行遍历只发生在同一个父节点下的各节点间。
1) 下行遍历:
2) 上行遍历:
示例代码:
from bs4 import BeautifulSoup
demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''
soup = BeautifulSoup(demo, "html.parser")
print(soup.name)
print(soup.parent)
for parent in soup.a.parents:
print(parent.name)
"""
结果如下:
[document]
None
p
body
html
[document]
"""
soup 是根节点,没有父节点。
3) 平行遍历:
六、基于bs4库的HTML格式输出
【注】本文课件来自北京理工大学网络公开课:Python网络爬虫与信息提取