Beautiful Soup 库入门

最新推荐文章于 2021-12-17 23:16:06 发布

不多余的星星

最新推荐文章于 2021-12-17 23:16:06 发布

阅读量492

点赞数 1

分类专栏： Python Learning 爬虫文章标签： python

本文链接：https://blog.csdn.net/CJX_up/article/details/77414857

版权

Python Learning 同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

爬虫

7 篇文章 0 订阅

订阅专栏

一、概念

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。

Beautiful Soup 库官网

二、Beautiful Soup库的安装

Windows平台: “以管理员身份运行”cmd
执行pip install beautifulsoup4

测试一下：

# 代码
from bs4 import BeautifulSoup

demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''

soup = BeautifulSoup(demo, 'html.parser')
print(soup.prettify())

结果如下：

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT‐268001" id="link1">
    Basic
Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT‐1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

使用 Beautiful Soup 库主要就是下面两行代码：

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>data</p>', 'html.parser')

【注】demo 文本可以通过 requests 库方法获得：
这里写图片描述

三、Beautiful Soup 库的基本元素

这里写图片描述

四、BeautifulSoup 类的基本元素

这里写图片描述

1) Tag 标签：
任何存在于HTML语法中的标签都可以用 soup.tag 访问获得，当HTML文档中存在多个相同 tag 对应内容时，soup.tag 返回第一个。

这里写图片描述

2) Tag 的 name：
每个 tag 都有自己的名字，通过 tag.name 获取，字符串类型。

这里写图片描述

3) Tag 的 attrs (属性)：

这里写图片描述

4) Tag 的 NavigableString：

这里写图片描述

5) Tag 的 Comment：

这里写图片描述

五、基于 bs4 库的 HTML 内容遍历方法

对于之前的 html 文本 demo：

demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''

其 HTML 基本格式如下：

这里写图片描述

可以形成如下的标签树：

这里写图片描述

对于标签树，有下行遍历、上行遍历和平行遍历三种遍历方式：
【注】平行遍历只发生在同一个父节点下的各节点间。

这里写图片描述

1) 下行遍历：

这里写图片描述

2) 上行遍历：

这里写图片描述

示例代码：

from bs4 import BeautifulSoup

demo = '''
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general‐purpose programming language. You can
learn Python from novice to professional by tracking the following courses:
<a href="http://www.icourse163.org/course/BIT‐268001" class="py1" id="link1">Basic
Python</a> and <a href="http://www.icourse163.org/course/BIT‐1001870001" class="py2"
id="link2">Advanced Python</a>.</p>
</body></html>
'''

soup = BeautifulSoup(demo, "html.parser")
print(soup.name)
print(soup.parent)
for parent in soup.a.parents:
    print(parent.name)

"""
结果如下：
[document]
None
p
body
html
[document]
"""

soup 是根节点，没有父节点。

3) 平行遍历：

这里写图片描述

六、基于bs4库的HTML格式输出

这里写图片描述

【注】本文课件来自北京理工大学网络公开课：Python网络爬虫与信息提取

不多余的星星

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Beautiful Soup 库入门

一、概念Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。Beautiful Soup 提供一些简单的、python 式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。Beautiful Soup 库官网二、Beautiful Soup库的安装Windows平台: “以管理员身份运行”cmd执行p
复制链接

扫一扫

专栏目录