【爬虫学的好，基础少不了】：数据解析之BeautifulSoup4库

最新推荐文章于 2021-06-04 15:11:31 发布

金鞍少年

最新推荐文章于 2021-06-04 15:11:31 发布

阅读量691

点赞数 1

本文链接：https://blog.csdn.net/weixin_42444693/article/details/105260238

版权

本文详细介绍了Python的BeautifulSoup4库，从库的简介和解析库对比，到基本使用方法、四大对象种类、遍历文档树的技巧，再到搜索文档树的find与find_all方法和select选择器的使用，深入浅出地讲解了如何高效解析HTML数据。

摘要由CSDN通过智能技术生成

文章目录

一、BeautifulSoup 4介绍

1.1 简介

简单来说，BeautifulSoup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。

Beautiful Soup能将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。如果文档没有说明编码方式， Beautiful Soup就需要自动设置编码方式。

1.2 解析库对比

序号	解析库	使用方法	优势	劣势
1	Python标准库(内置库)	BeautifulSoup(html,“html.parser”)	Python内置标准库，执行速度快	容错能力较差
2	lxml HTML解析库	BeautifulSoup(html,’lxml’)	速度快，容错能力强	需要安装，需要C语言库
	lxml XML解析库	BeautifulSoup(html,[‘lxml’,’xml’])	速度快，容错能力强，支持XML格式	需要C语言库
4	htm5lib解析库	BeautifulSoup(html,’htm5llib’)	以浏览器方式解析，最好的容错性	速度慢

二、BeautifulSoup 4基本使用

2.1 安装和文档

pip install Beautifulsoup4

中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

2.2 导入使用

form bs4 import BeautifulSoup

2.3 基础操作实例

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html,'lxml')

# prettify() 方法 格式化输出 ; 将Beautiful Soup的文档树格式化后以Unicode编码输出,每个XML/HTML标签都独占一行
print(soup.prettify())

三、四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

序号	对象	作用
1	Tag	BeautifulSoup中所有的标签都是Tag类型，并且BeautifulSoup的对象其实本质上也是一个Tag类型。所以其实一些方法比如find、find_all并不是BeautifulSoup的，而是Tag的。 tag对应Html中的标签
2	NavigableString	获取标签内的标签内部的文字,它继承自python中的str，用起来就跟使用python的str是一样的。
3	BeautifulSoup	BeautifulSoup 对象表示的是一个文档的内容。也可以认为它是一个特殊的 Tag
4	Comment	Comment 对象是一个特殊类型的 NavigableString 对象，可以获取文档中注释节点的内容。

3.1 Tag 实例

下面的 title head a p等等 HTML 标签加上里面包括的内容就是 Tag，,但是注意它查找的是在所有内容中的第一个符合要求的标签。

<head><title>The Dormouse's story</title></head>
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

获取tag

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<b><!--Hey, buddy. Want to buy a used parser?--></b>
"""

soup = BeautifulSoup(html,'lxml')

# 获取 tag
print(soup.p)  # <p class="title" name="dromouse"></p>

# 获取p标签的属性
print(type(soup.p)) # <class 'bs4.element.Tag'>

对于 Tag，它有两个重要的属性，是 name 和 attrs

 # 获取标签名
print(soup.p.name)  # p

# 获取标签名所有属性 
print(soup.p.attrs)  # {'class': ['title'], 'name': 'dromouse'}

 # 获取P标签下class的值
print(soup.p['class']) # ['title']
print(soup.p.get('class')) # ['title']

3.2 NavigableString 实例

获取标签内部的文字用 .string 即可，例如：

from bs4.element import NavigableString

# 获取a标签内的文字内容
print(soup.a.string)  # Elsie 

# 判断a.string的类型
print(type(soup.a.string)) # <class 'bs4.element.Comment'>

3.3 BeautifulSoup 实例

是一个特殊的 Tag，我们可以分别获取它的类型，名称，以及属性，例如：

# BeautifulSoup的类型，名称以及属性

print(type(soup))  
# <class 'bs4.BeautifulSoup'>

print(type(soup.name))  
# <class 'str'>

p

最低0.47元/天解锁文章

金鞍少年

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【爬虫学的好，基础少不了】：数据解析之BeautifulSoup4库

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据。
复制链接

扫一扫

专栏目录