BeautifulSoup

最新推荐文章于 2023-11-12 14:34:27 发布

Lipgrant_python

最新推荐文章于 2023-11-12 14:34:27 发布

阅读量3.2k

点赞数 20

分类专栏：爬虫文章标签： BeautifulSoup

本文链接：https://blog.csdn.net/weixin_41677555/article/details/85004804

版权

爬虫专栏收录该内容

2 篇文章

订阅专栏

本文详细介绍了Python库BeautifulSoup的使用方法，包括安装、解析器选择、原生类型、导航树、搜索树及过滤器等核心内容，帮助读者掌握网页数据抓取技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

概述

什么是BeautifulSoup? BeautifulSoup是从HTML或者XML文件提取数据的一个Python库.常用于爬虫中.

目前BeautifulSoup的最新版本是4.4.0,建议采用4+以上的版本,因为3+的版本无人维护.

虽然叫法是BeautifulSoup,但安装包的名字是beautifulsoup4.

pip install beautifulsoup4

使用时的导入方式:

from bs4 import BeautifulSoup

解析器

BeautifulSoup虽然是解析HTML和XML文件的,但其需要依赖于其他的解析库,如lxml等.

使用不同的解析库,其解析性能是不同的,将其总结如下:

解析库	使用方法	优势	缺点
html.parser	BeautifulSoup(content,'html.parser')	python标准库解析器速度一般容错力强	Python 2.7.3 or 3.2.2前的版本容错力差
lxml HTML	BeautifulSoup(content,'lxml')	快速稳定	依赖三方库lxml
lxml XML	BeautifulSoup(content,'lxml-xml') BeautifulSoup(content,'xml')	快速唯一支持XML解析的	依赖三方库lxml
html5lib	BeautifulSoup(content,'html5lib')	容错力最强以浏览器的方式解析文档生成有效的HTML5文件	速度慢依赖其他三方库

解析器的差异

有时候,HTML或者XML文件中的代码并一定是完全规范的,比如有些标签没有闭合.

当处理这类不规范代码的文件时,解析器不同结果也会有差异.

以一个简单的例子来看:

comtent="<a></p>"
from bs4 import BeautifulSoup

soup1=BeautifulSoup(comtent,'html.parser')
soup2=BeautifulSoup(comtent,'lxml')
soup3=BeautifulSoup(comtent,'html5lib')
soup4=BeautifulSoup(comtent,'xml')

print(soup1)#<a></a>
print(soup2)#<html><body><a></a></body></html>
print(soup3)#<html><head></head><body><a><p></p></a></body></html>
print(soup4)#<a/>

很明显,对于完整性和规范性来说,html5lib解析库最完整和规范.但该解析库在性能上比较慢.

所以,需要根据实际情况来选择解析库.

原生类型

在BeautifulSoup这个模块中,主要包含4个类型的对象.在对HTML或者XML文件的解析和数据提取中,

主要就是用到这四类对象进行操作.

对象名称	说明
Tag	与html或xml文档中的标签的概念类似
NavigableString	代表着html或者xml标签中包含的文字
BeautifulSoup	代表整个html或者xml文档
Comment	既html或者xml文档中的注释内容

下面以一段简单的HTML代码为例,展示下这四个类型的对象.

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
</body>
</html>
"""


from bs4 import BeautifulSoup

soup=BeautifulSoup(html,'lxml')
tag=soup.find('p')
print(type(tag))#Tag
string=tag.string
print(type(string))#NavigableString

print(type(soup))#BeautifulSoup

#输出:
#<class 'bs4.element.Tag'>
#<class 'bs4.element.NavigableString'>
#<class 'bs4.BeautifulSoup'>

Comment类的对象比较特殊,它是注释内容.

soup=BeautifulSoup('<b><!--Hey--></b>','lxml')
comment=soup.b.string
print(type(comment))

#输出:
<class 'bs4.element.Comment'>

Tag对象

Tag对象类似于HTML文档的标签.

对于标签来说,最重要的就是名字name和属性attrs.

soup=BeautifulSoup('<p id=123 class="red bule">Hey</p>','lxml')
tag=soup.p
print(tag.name)#p
print(tag.attrs)#{'id': '123', 'class': ['red']}

name 和 attrs属性可以查看,当然就可以修改.

很明显,attrs属性的返回值是字典,所以可以想操作字典一样操作attrs属性

若同一个属性存在多个值,会已经列表的形式存在.

如class同时拥有多个属性.

soup=BeautifulSoup('<p id=123 class="red bule">Hey</p>','lxml')
tag=soup.p
tag.name='a'
tag.attrs['id']=456
tag.attrs['class'][0]='white'
print(soup)

#输出:<html><body><a class="white bule" id="456">Hey</a></body></html>

NavigableString对象

NavigableString对象代表的是Tag对象中包含的文字.

虽然它看起来像一个字符串,但实际上并不是,而是NavigableString对象.

所以如果要在其他的地方引用的话,最好手动的转换成str,虽然不转换其也支持字符串的操作方法.

from bs4 import BeautifulSoup

soup=BeautifulSoup('<p>Hey</p>','lxml')
tag=soup.p
string=tag.string
print(type(string))
print(string.split('e'))
string=str(string)
print(string.lower())

#输出:
#<class 'bs4.element.NavigableString'>
#['H', 'y']
#hey

NavigableString同样可以被直接修改,也可以使用repalce_with的方法来修改.

from bs4 import BeautifulSoup

soup=BeautifulSoup('<p>Hey</p>','lxml')
tag=soup.p
a='Heloo'
tag.string=a
print(soup)
tag.string.replace_with('KO')
print(soup)

#输出:
#<html><body><p>Heloo</p></body></html>
#<html><body><p>KO</p></body></html>

BeautifulSoup对象

BeautifulSoup对象代表是的整个HTML文档对象,也可以将其看作一个多层嵌套的name为html的Tag对象.

但实际上BeautifulSoup对象并不与html标签相符,所以认为的为其的name属性设置为[document]

soup=BeautifulSoup('<p>Hey</p>','lxml')
print(soup.name)

#输出:
#[document]

Comment对象

前面三类对象基本已经覆盖了HTML文档处理中所有的需求.

Comment对象只是对于注释内容的补充,可以将其视为一中特殊的NavigableString对象.

日常中几乎没有用到.

Navigate Tree(导航树)

了解BeautifulSoup对象概念后,导航树即使在对象级别上的一种层次关系.

当选定一个对象,就是产生该对象的子对象,父对象或者兄弟对象.

还是用例子说明:

html = """
<div>Total
    <p class="story"> First_p
        <a id="1">El</a>,
        <a id="2">E2</a>,
        <a id="3">E3</a>,
    </p>
    <p>Second_p</p>
</div>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
tag=soup.p

首先,存在多个标签时,使用标签名称取到的永远是第一个该标签.

先从contents说起.contents返回该标签中所有的元素组成的列表.

print(len(tag.contents))
print(tag.contents)

#输出:
7
[' First_p\n        ', <a id="1">El</a>, ',\n        ', <a id="2">E2</a>, ',\n        ', <a id="3">E3</a>, ',\n    ']

有点奇怪,按一般理解tag中(也就是第一个p标签)应该是包含3个a标签,但实际却是7个元素.

因为contents返回时的标签中包含的所有元素,这个元素包括NavigableString对象.

需要注意的是,连换行符也会被视为一个元素.

而与contents对应的是一个名为children的生成器,也就是将列表换成了一个生成器.

contents和children是平级元素,descendants就是递归所有元素.

print(len(list(tag.descendants)))
print(list(tag.descendants))
#输入:
10
[' First_p\n        ', <a id="1">El</a>, 'El', ',\n        ', <a id="2">E2</a>, 'E2', ',\n        ', <a id="3">E3</a>, 'E3', ',\n    ']

同理:

parent和parents分别代表tag对象的父级对象和父级对象生成器.

next_sibling,next_sibings和previous_sibling, previous_siblings分别代表之后/之前的兄弟元素和生成器

注意的是,兄弟元素是按照 contents列表的顺序,也就是换行符也是兄弟元素.

和sibling方法类似的element方法

next_element,next_elements,previous_element,previous_elements

这几个方法和sibling有一点区别,范围不同.

sibling是仅针对tag对象范围内的元素,而element是针对整个html的元素.

print(list(tag.contents[4].next_siblings))
print(list(tag.contents[4].next_elements))

#输出:
[<a id="3">E3</a>, ',\n    ']
[<a id="3">E3</a>, 'E3', ',\n    ', '\n', <p>Second_p</p>, 'Second_p', '\n', '\n']

很明显,tag对象以为,但仍在html对象中的元素都会被elements迭代范围.

最后补充说明下string方法.

当任何一个Tag或者BeautifulSoup对象的conments长度等于1,都可以使用string来提取NavigaleString对象

也就是字符内容

但当conments长度大于1,string则不起作用了,此时可以使用strings.

strings是一个生成器,还有一个剔除了多余空白符的stripped_strings生成器.

print(list(tag.strings))
print(list(tag.stripped_strings))

# 输出:
[' First_p\n        ', 'El', ',\n        ', 'E2', ',\n        ', 'E3', ',\n    ']
['First_p', 'El', ',', 'E2', ',', 'E3', ',']

Searching the tree(搜索树)

BeautifulSoup提供了很多的全局搜索方法.

find	find_all
find_parent	find_parents
find_next_sibling	find_next_siblings
find_previous_sibling	find_previous_siblings
find_next	find_all_next
find_previous	find_all_previous

这些方法名称上,也大致也可以推断作用.

同时这些方法也同样的用法和参数.

最常用的就是findall了.下面就以findall举例来说明.

过滤器

过滤器的目的是给搜索方法(如find_all),提供快速而精确的搜索的表达式.

这些表达式可以是标签,属性,字符,或者是这些组合而成.

总结性来说,有五类表达式.

分别是:标签,正则表达式,列表,布尔值True,函数

下面还是以一个例子进行说明.

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<div>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
</div>
<div>
<p class="st">Last<p class="st">......</p></p>
</div>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

正则表达式

可以接受正则表达式作为过滤,比如所有名称中包含'a'的标签.

import  re

print(soup.find_all(re.compile('a')))

#输出:
[<head><title>The Dormouse's story</title></head>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

列表

列表中所包含的元素都将作为过滤标准,比如搜索所有的a标签和b标签.


print(soup.find_all(['a','b']))

#输出:
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

布尔值 True

布尔值True将匹配所有的Tag类型.

print(soup.find_all('p')[1].find_all(True))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

函数

有了布尔值True做基础,可就是设计一个以每一个标签作为唯一参数且返回BOOL值的函数了.

def filter(tag):
    return tag['id']=='link2'

print(soup.find_all('p')[1].find_all(filter))

#输出:
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

参数

find_all(name, attrs, recursive, string, limit, **kwargs)

name参数: 接收的就是上面介绍的过滤器.不在重复介绍了.

attrs参数: 接收HTML协议中,每个标签可以有属性,例如id,class,href等等.

print(soup.find_all(id='link1'))
print(soup.find_all(href="http://example.com/elsie"))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

需要注意的时,由于class属性与python中的关键字冲突,所以可以使用class_代替.

另外一个更好的解决方法是使用字典代替.

同时attrs属性可以综合性使用

print(soup.find_all(class_='sister',id='link1'))
print(soup.find_all(attrs={'class':'sister','id':'link2'}))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

recursive参数:表示是否递归式的搜索.

默认值为True,表示搜索子代,及其子代的所有后代.

Fasle表示仅搜索子代.

soup=soup.find('body')

print(soup.find_all(class_='st'))
print(soup.find_all(class_='st',recursive=False))

#输出:
[<p class="st">Last</p>, <p class="st">......</p>]
[]

该例子中, 该soup对象为body,而body的两个子代div中并无指定搜索的元素,而是在div的子代含有.

故True递归时可以搜索到,而非递归在搜索不到.

string参数: 包含指定字符串的标签,一般联合name参数使用.

print(soup.find_all('a',string='Elsie'))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

limit参数:限制返回结果的数量,如find_all('a',limit=1)等同于find('a')

print(soup.find_all('a',limit=1))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

当然参数都可以的综合性的使用:

print(soup.find_all('a',attrs={'class':'sister'},string='Elsie'))

#输出:
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

其他所有的参数和find_all是一致的.

find方法在搜索不到时返回None,而find_all是返回空列表[]

其他方法也是这个规律

CSS选择器

BeautifulSoup同样支持CSS选择器,使用select方法即可.

select方法是也以列表的方式返回,若无匹配则返回空列表[]

而select_one方法类似于select(limit=1).

print(soup.select('.st'))

#输出:
[<p class="st">Last</p>, <p class="st">......</p>]

至于CSS选择器的语法,可以查阅CSS的相关文档:http://www.w3school.com.cn/cssref/css_selectors.asp

其他功能

修改文档

前面提到一个Tag对象最重要的属性就是name和attrs.可以通过修改name,attrs或者string修改文档.

同时,BeautifulSoup也提供了insert,replace_with,append,decompose等等方法来修改文档.

具体可以查询官方文档.

美化输出

BeautifulSoup提供了一个prettify()方法可以对不完整或者不规范的HTML文档进行规整.

其他

编码处理,BS3的老版本等为可以查阅官网.

BeautifulSoup

概述

解析器

解析器的差异

原生类型

Tag对象

NavigableString对象

BeautifulSoup对象

Comment对象

Navigate Tree(导航树)

Searching the tree(搜索树)

过滤器

标签

正则表达式

列表

布尔值 True

函数

参数

CSS选择器

其他功能

修改文档

美化输出

其他