Python之BeautifulSoup库详解

最新推荐文章于 2024-03-21 08:49:26 发布

FearlessVoyager

最新推荐文章于 2024-03-21 08:49:26 发布

阅读量3.1k

点赞数 5

分类专栏： python 文章标签： python beautifulsoup 开发语言

本文链接：https://blog.csdn.net/qq_33807380/article/details/129191505

版权

python 专栏收录该内容

17 篇文章 1 订阅

订阅专栏

一、简介

BeautifulSoup是一个灵活方便的网页解析库，处理高效，能够自动的将输入文档转换为Unicode编码，输出文档转换为utf-8编码，且支持多种解析器。其最主要的功能是从网页抓取数据。

二、解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, ‘html.parser’)	python内置的标准库，执行速度适中	Python3.2.2之前的版本容错能力差
lxml HTML解析器	BeautifulSoup(markup, ‘lxml’)	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup ‘xml’)	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, ‘html5lib’)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢，不依赖外部拓展

三、基本使用步骤

3.1 获取网页源码也可以通过字符串自己构建一个网页的源码

# 给请求指定一个请求头来模拟chrome浏览器
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
# 爬图地址
addr = 'https://sc.chinaz.com/tupian/'

def getHtmlSource():
	# 使用请求头来模拟chrome浏览器访问网页，获取页面响应结果
    res = requests.get(addr, headers=headers).text
    print(res)

requests.get().text返回的是Unicode型的数据，requests.get().content返回的是bytes型的数据。如果只是使用.text来获取页面源码的话，获取的源码中中文会乱码。可以使用一下方法解决中文乱码：
（1）手动指定页面编码

    res = requests.get(addr, headers=headers)
    res.encoding = 'UTF-8'
    html_doc = res.text

（2）使用.content方法

    html_doc = str(requests.get(addr, headers=headers).content, 'UTF-8')

3.2 使用解析器解析页面响应结果

# 使用自带的html.parser解析页面响应结果
soup = BeautifulSoup(html_doc, 'html.parser')
# 使用lxml HTML解析器解析页面响应结果
soup = BeautifulSoup(html_doc, 'lxml')
# 使用lxml XML解析器解析页面响应结果
soup = BeautifulSoup(html_doc, 'xml')
# 使用html5lib解析页面响应结果
soup = BeautifulSoup(html_doc, 'html5lib')

四、四大对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

4.1 Tag

Tag 通俗点讲就是 HTML 中的一个个标签，例如：title、head、a、p等等 HTML 标签加上里面包括的内容就是 Tag。但是注意，它查找的是在所有内容中的第一个符合要求的标签。

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_doc, 'html.parser')
# 通过Tag对象获取title标签信息
print(soup.title)
## <title>图片、图片下载、高清图片材</title>

# 通过Tag对象获取a标签信息
print(soup.a)
## <a class="logo" href="/"><img src="../static/common/com_images/image.png"/></a>

Tag 它有两个重要的属性，是 name 和 attrs。
(1) name: 输出标签的标签类型名:

print(soup.title.name)
# title
print(soup.p.name)
# p

(2) attrs: 以字典的形式获取标签的属性:

 # 获取标签的所有属性
 soup.p.attrs
 # 获取标签的某个属性
 soup.p.attrs['js-do']
 soup.p.get('js-do')	
 # 修改标签的属性
 soup.p.attrs['js-do'] = 'newContenct'
 # 删除标签的属性
 del soup.p.attrs['js-do']

4.2 NavigableString

注意作用是为了获取标签内部的文字。

# 获取标签内部文字
print(soup.title.string)

4.3 BeautifulSoup

BeautifulSoup 对象表示的是一个文档的内容。大部分时候,可以把它当作 Tag 对象，是一个特殊的 Tag，我们也可以获取它的属性及名称。

# 获取BeautifulSoup对象名称
print(soup.name)
# 获取BeautifulSoup对象属性
print(soup.attr)

4.4 Comment

Comment 对象是一个特殊类型的 NavigableString 对象，如果标签内部的内容是注释，其输出的内容不包括注释符号。

print(soup.a)
## <a class="logo" href=""><!-- zhushi --></a>
print(soup.a.string)
## zhushi
print(type(soup.a.string))
## <class 'bs4.element.Comment'>

五、搜索文档树

find(name, attrs, recursive, string, **kwargs)：获取匹配的第一个标签；
find_all(name, attrs, recursive, string, limit, **kwargs) ：返回结果是值包含一个元素的列表；

name：是根据标签的名称进行匹配，name的值相当于过滤条件，可以是一个具体的标签名，多个标签名组成的列表，或者是一个正在表达式，甚至是函数方法等等。
attrs：是根据标签的属性进行匹配。
recursive：是否递归搜索，默认为True，会搜索当前tag的所有子孙节点，设置为False，则只搜索儿子节点。
string：是根据标签的文本内容去匹配。
limit：设置查询的结果数量。
kwargs：也是根据标签的属性进行匹配，与attrs的区别在于写法不一样，且属性的key不能是保留字，也不能与其他参数名相同。

5.1 使用name进行匹配

# 查找所有的<a>标签
soup.find_all(name="a") # 可以简写成 soup.find_all("a")

# 查找所有的<title>标签或者<link>标签
soup.find_all(name={'title', 'link'}) # 可以简写成 soup.find_all(['title', 'link']) 

# 查找所有以a开头的标签
soup.find_all(name=re.compile("^a")) # 可以简写成 soup.find_all(re.compile("^a")) 

# 查找有class属性并且没有id属性的节点
soup.find_all(hasClassNoId)
def hasClassNoId(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    
# 查找body里面所有标签
soup.find('body').find_all(True)

5.2 使用attrs进行匹配

# 查找所有name属性值为tb_name的标签
soup.find_all(attrs={"name": "tb_name"})

# 查找所有id属性值为id_attr1或者id_attr2的标签
soup.find_all(attrs={'id': ['id_attr1', 'id_attr2']})

# 查找id属性值中含有id_的所有标签
soup.find_all(attrs={'id':re.compiles('id_')})

# 查找含有id属性的所有tag
soup.find_all(attrs={'id':True})

# 查找href属性值中以.html结尾的所有标签
soup.find_all(attrs={'href': search_html})
def search_html(attr):
    return attr and attr.lower().endswith('.html')

5.3 使用kwargs进行匹配

也是通过标签的属性进行匹配，需要特别注意的是name属性以及class属性，name属于与find_all方法的第一个参数名相同，这里使用name='属性值'进行查询的话，如果是一个参数会与标签名进行匹配，如果是多个参数，则方法会报错；而class是python的保留字，使用class属性进行匹配的时候需要写成class_='属性值'的方式

# 查找<div>中name属性值是backpage的所有标签
soup.find_all('div', name='backpage')  ## 会报错

# 查找class属性值是backpage的所有标签
soup.find_all(class_='backpage')

# 查找所有id属性值为id_attr1或者id_attr2的标签
soup.find_all(id=['id_attr1', 'id_attr2'])

# 查找href的属性值包含.html的所有的标签
soup.find_all(href=re.compile('.html'))

# 查找含有id属性的所有tag
soup.find_all(id=True)

# 查找href属性值中以.html结尾的所有标签
soup.find_all(href= search_html)
def search_html(attr):
    return attr and attr.lower().endswith('.html')

5.4 使用string进行匹配

需要注意的是这里返回标签的值，如果需要获取到对应的标签，可以使用previous_element属性来获得

# 查找标签的value是'上一页'的所有value值
soup.find_all(string='上一页')

# 查找标签的value是'上一页'的所有标签
[value.previous_element for value in soup.find_all(string='上一页')]

# 查找value是'上一页'或者'下一页'的所有value值
soup.find_all(string=['上一页','下一页'])

# 查找value中存在'页'的所有value值
soup.find_all(string=re.compile('页'))

# 查找在value值的所有的string
soup.find_all(string=True)

# 查找所有value值是以'页'为结尾的value值
soup.find_all(string=search_string)
def search_string(string):
    return string and string.lower().endswith('页')

六、遍历文档树

contents：返回的是一个包含所有儿子节点的列表。
children：返回的是一个包含所有儿子节点的迭代器。
descendants：返回的是一个包含所有子孙节点的生成器。

contents、children只包含直接儿子节点，descendants既包含儿子节点还包含孙子节点。

6.1 通过contents获取目标节点的所有子节点

tag_soup = soup.find('div', class_='container').contents
print(type(tag_soup))
for t in tag_soup:
	if t != '\n':  # 去掉换行符
		print(t)

在这里插入图片描述

6.2 通过children获取目标节点的所有子节点

tag_soup = soup.find('div', class_='container').children
print(type(tag_soup))
for t in tag_soup:
	if t != '\n':  # 去掉换行符
		print(t)

在这里插入图片描述

6.3 通过descendants获取目标节点的所有子孙节点

tag_soup = soup.find('div', class_='container').descendants
print(type(tag_soup))
for t in tag_soup:
	if t != '\n':  # 去掉换行符
		print(t)

在这里插入图片描述

6.4 通过parents获取目标节点的所有祖先节点

tag_soup = soup.find('div', class_='container').parents
print(type(tag_soup))
for t in tag_soup:
	if t != '\n':  # 去掉换行符
		print(t)

6.5 获取目标节点相关联的其他节点

a_soup = soup.find('div', class_='container').a  # 获取div里面的第一个<a>标签

print(a_soup.parent)  # 获取<a>标签的父节点

print(a_soup.next_sibling)  # 获取<a>标签的下一个兄弟节点

print(a_soup.previous_sibling)  # 获取<a>标签的上一个兄弟节点

print(a_soup.next_siblings)  # 获取<a>标签下面的所有兄弟节点

print(a_soup.previous_siblings)  # 获取<a>标签上面的所有兄弟节点

七、css选择器

7.1 通过标签名查找

# 查找所有title标签
soup.select('title')

# 查找div下的所有input标签
soup.select('div input')

# 查找html节点下的head节点下的title标签
soup.select("html head title")

7.2 通过id查找

# 查找id为id_text的标签
soup.select("#id_text")

# 查找id为id_text1、id_text2的标签
soup.select("#id_text1, #id_text2")

# 查找id为id_text1的input标签
soup.select('input#id_text1')

7.3 通过类名查找

# 查找类名为nextpage的标签
soup.select(".nextpage")

# 查找类名为nextpage、active的标签
soup.select('.nextpage, .active')

# 查找类名为nextpage的a标签
soup.select('a.nextpage')

7.4 通过属性查找

# 选择有href属性的a标签
soup.select('a[href]')

# 选择href属性为index_2.html的a标签
soup.select('a[href="index_2.html"]')

# 选择href以index开头的a标签
soup.select('a[href^="index"]')

# 选择href以html结尾的a标签
soup.select('a[href$="html"]')

# 选择href属性包含index的a标签
soup.select('a[href*="index"]')

7.5 其他选择器

# 查找div标签下的a标签
soup.select("div > a")

# 父节点中的第3个a标签
soup.select("a:nth-of-type(3)")

# a标签之后的input标签(a和input有共同父节点)
soup.select("a~input")

FearlessVoyager

关注

5
点赞
踩
54

收藏

觉得还不错? 一键收藏
0
评论
Python之BeautifulSoup库详解

BeautifulSoup是一个灵活方便的网页解析库，处理高效，能够自动的将输入文档转换为Unicode编码，输出文档转换为utf-8编码，且支持多种解析器。其最主要的功能是从网页抓取数据。
复制链接

扫一扫