python库介绍—Beautiful Soup

编程刘明

于 2024-08-08 15:42:55 发布

阅读量616

点赞数 28

分类专栏：编程文章标签： python java android

本文链接：https://blog.csdn.net/qq_56262770/article/details/141028208

版权

编程专栏收录该内容

22 篇文章 1 订阅

订阅专栏

在这里插入图片描述

Beautiful Soup 简称 BS4（其中 4 表示版本号）是一个 Python 第三方库，它可以从 HTML 或 XML 文档中快速地提取指定的数据。

BS4 解析页面时需要依赖文档解析器

Python 也自带了一个文档解析库 html.parser，但是其解析速度要稍慢于 lxml


#导入解析包
from bs4 import BeautifulSoup

#创建beautifulsoup解析对象 html.parser是解析库，还可以使用lxml等
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())   #prettify()用于格式化输出html/xml文档

如果是外部文档，您也可以通过 open() 的方式打开读取，语法格式如下：

soup = BeautifulSoup(open('html_doc.html', encoding='utf8'), 'lxml')

####   .text 和 .getText 都可以获取该标签下的文本内容
soup.find_all('div',class_='name').text
soup.find_all('div',class_='name').getText


####   .get() 里面写上属性名称，即可获取属性值
soup.find_all('div',class_='name').get('src')
soup.find_all('div',class_='name').get('title')

find_all()与find()

find_all() 与 find() 是解析 HTML 文档的常用方法，它们可以在 HTML 文档中按照一定的条件（相当于过滤器）查找所需内容

(1) find_all()

find_all() 方法用来搜索当前 tag 的所有子节点，并判断这些节点是否符合过滤条件，最后以列表形式将符合条件的内容返回

find_all( name , attrs , recursive , text , limit )

参数说明：

name：查找所有名字为 name 的 tag 标签，字符串对象会被自动忽略。
attrs：按照属性名和属性值搜索 tag 标签，注意由于 class 是 Python 的关键字吗，所以要使用 “class_”。
recursive：find_all() 会搜索 tag 的所有子孙节点，设置 recursive=False 可以只搜索 tag 的直接子节点。
text：用来搜文档中的字符串内容，该参数可以接受字符串、正则表达式、列表、True。
limit：由于 find_all() 会返回所有的搜索结果，这样会影响执行效率，通过 limit 参数可以限制返回结果的数量。

from bs4 import BeautifulSoup
import re

html_doc = """
<html><head><title>"c语言中文网"</title></head>
<body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>
<a href="http://c.biancheng.net/django/" id="link3">django教程</a>
<p class="vip">加入我们阅读所有教程</p>
<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>
"""
#创建soup解析对象
soup = BeautifulSoup(html_doc, 'html.parser')
#查找所有a标签并返回
print(soup.find_all("a"))
#查找前两条a标签并返回
print(soup.find_all("a",limit=2))
#只返回两条a标签

最后以列表的形式返回输出结果，如下所示：

[
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>, <a href="http://c.biancheng.net/django/" id="link3">django教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]

按照标签属性以及属性值查找 HTML 文档，如下所示


print(soup.find_all("p",class_="website"))
print(soup.find_all(id="link4"))

输出结果：

[<p class="website">一个学习编程的网站</p>]
[<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]

正则表达式、列表，以及 True 也可以当做过滤条件，使用示例如下：

#列表行书查找tag标签
print(soup.find_all(['b','a']))
#正则表达式匹配id属性值print(soup.find_all('a',id=re.compile(r'.\d')))
print(soup.find_all(id=True))
#True可以匹配任何值，下面代码会查找所有tag，并返回相应的tag名称
for tag in soup.find_all(True):
    print(tag.name,end=" ")
    #输出所有以b开始的tag标签
for tag in soup.find_all(re.compile("^b")):
	print(tag.name)

输出结果如下：


#第一个print输出：
[<b>c.biancheng.net</b>, <a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>, <a href="http://c.biancheng.net/django/" id="link3">django教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]

#第二个print输出：
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>, <a href="http://c.biancheng.net/django/" id="link3">django教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]
#第三个print输出：
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>, <a href="http://c.biancheng.net/django/" id="link3">django教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]
#第四个print输出：
html head title body p b p a a a p a
#最后一个输出：
body b

BS4 为了简化代码，为 find_all() 提供了一种简化写法，如下所示：

#简化前
soup.find_all("a")
#简化后
soup("a")

上述两种的方法的输出结果是相同的。

(2) find()

find() 方法与 find_all() 类似，不同之处在于 find_all() 会将文档中所有符合条件的结果返回，而 find() 仅返回一个符合条件的结果，所以 find() 方法没有 limit参数。使用示例如下：

from bs4 import BeautifulSoup
import re
html_doc = """<html><head><title>"c语言中文网"</title></head><body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a><a href="http://c.biancheng.net/django/" id="link3">django教程</a>
<p class="vip">加入我们阅读所有教程</p>
<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>"""
#创建soup解析对象
soup = BeautifulSoup(html_doc, 'html.parser')
#查找第一个a并直接返回结果
print(soup.find('a'))
#查找title
print(soup.find('title'))
#匹配指定href属性的a标签
print(soup.find('a',href='http://c.biancheng.net/python/'))
#根据属性值正则匹配
print(soup.find(class_=re.compile('tit')))
#attrs参数值
print(soup.find(attrs={'class':'vip'}))

输出结果如下：

a标签：
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
指定href属性：
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
title:
<title>"c语言中文网"</title>
正则匹配：
<p class="title"><b>c.biancheng.net</b></p>
#attrs参数值
<p class="vip">加入我们阅读所有教程</p>

使用 find() 时，如果没有找到查询标签会返回 None，而 find_all() 方法返回空列表。示例如下：

print(soup.find('bdi'))
print(soup.find_all('audio'))

输出结果如下：

None
[]

BS4 也为 find()提供了简化写法，如下所示：

#简化写法
print(soup.head.title)
#上面代码等价于
print(soup.find("head").find("title"))

两种写法的输出结果相同，如下所示：

#简化写法
print(soup.head.title)
#上面代码等价于
print(soup.find("head").find("title"))

CSS选择器

BS4 支持大部分的 CSS 选择器，比如常见的标签选择器、类选择器、id 选择器，以及层级选择器。Beautiful Soup 提供了一个 select() 方法，通过向该方法中添加选择器，就可以在 HTML 文档中搜索到与之对应的内容。应用示例如下：

#coding:utf8
html_doc = """<html><head><title>"c语言中文网"</title></head><body>
<p class="title"><b>c.biancheng.net</b></p>
<p class="website">一个学习编程的网站</p>
<a href="http://c.biancheng.net/python/" id="link1">python教程</a>
<a href="http://c.biancheng.net/c/" id="link2">c语言教程</a><a href="http://c.biancheng.net/django/" id="link3">django教程</a>
<p class="vip">加入我们阅读所有教程</p>
<a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>
<p class="introduce">介绍:<a href="http://c.biancheng.net/view/8066.html" id="link5">关于网站</a>
<a href="http://c.biancheng.net/view/8092.html" id="link6">关于站长</a></p>
"""
from bs4 import 
soup = BeautifulSoup(html_doc, 'html.parser')
#根据元素标签查找
print(soup.select('title'))
#根据属性选择器查找
print(soup.select('a[href]'))
#根据类查找
print(soup.select('.vip'))
#后代节点查找
print(soup.select('html head title'))
#查找兄弟节点
print(soup.select('p + a'))
#根据id选择p标签的兄弟节点
print(soup.select('p ~ #link3'))
#nth-of-type(n)选择器，用于匹配同类型中的第n个同级兄弟元素
print(soup.select('p ~ a:nth-of-type(1)'))
#查找子节点
print(soup.select('p > a'))
print(soup.select('.introduce > #link5'))

输出结果：


第一个输出：
[<title>"c语言中文网"</title>]

第二个输出：
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://c.biancheng.net/c/" id="link2">c语言教程</a>, <a href="http://c.biancheng.net/django/" id="link3">django教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>, <a href="http://c.biancheng.net/view/8066.html" id="link5">关于网站</a>, <a href="http://c.biancheng.net/view/8092.html" id="link6">关于站长</a>]

第三个输出：
[<p class="vip">加入我们阅读所有教程</p>]

第四个输出：
[<title>"c语言中文网"</title>]

第五个输出：
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>, <a href="http://vip.biancheng.net/?from=index" id="link4">成为vip</a>]

第六个输出：
[<a href="http://c.biancheng.net/django/" id="link3">django教程</a>]

第七个输出：
[<a href="http://c.biancheng.net/python/" id="link1">python教程</a>]

第八个输出：
[<a href="http://c.biancheng.net/view/8066.html" id="link5">关于网站</a>, <a href="http://c.biancheng.net/view/8092.html" id="link6">关于站长</a>]

最后的print输出：
[<a href="http://c.biancheng.net/view/8066.html" id="link5">关于网站</a>]

关于python技术储备

由于文章篇幅有限，文档资料内容较多，需要这些文档的朋友，可以加小助手微信免费获取，【保证100%免费】，中国人不骗中国人。

在这里插入图片描述

                                     **（扫码立即免费领取）**

全套Python学习资料分享：

一、Python所有方向的学习路线

Python所有方向路线就是把Python常用的技术点做整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。

二、学习软件

工欲善其事必先利其器。学习Python常用的开发软件都在这里了，还有环境配置的教程，给大家节省了很多时间。

三、全套PDF电子书

书籍的好处就在于权威和体系健全，刚开始学习的时候你可以只看视频或者听某个人讲课，但等你学完之后，你觉得你掌握了，这时候建议还是得去看一下书籍，看权威技术书籍也是每个程序员必经之路。

四、入门学习视频全套

我们在看视频学习的时候，不能光动眼动脑不动手，比较科学的学习方法是在理解之后运用它们，这时候练手项目就很适合了。

五、实战案例

光学理论是没用的，要学会跟着一起敲，要动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。

在这里插入图片描述

如有侵权，请联系删除。

编程刘明

关注

28
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python库介绍—Beautiful Soup

find() 方法与 find_all() 类似，不同之处在于 find_all() 会将文档中所有符合条件的结果返回，而 find() 仅返回一个符合条件的结果，所以 find() 方法没有。Python所有方向路线就是把Python常用的技术点做整理，形成各个领域的知识点汇总，它的用处就在于，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。
复制链接

扫一扫

专栏目录