BeautifulSoup教学：BeautifulSoup文档用法详解（精炼详细）

最新推荐文章于 2025-03-06 23:42:47 发布

我药打十个

最新推荐文章于 2025-03-06 23:42:47 发布

阅读量3.3k

点赞数 31

分类专栏：爬虫系列文章标签： python beautifulsoup 爬虫

本文链接：https://blog.csdn.net/newxiaoou/article/details/134892949

版权

爬虫系列专栏收录该内容

8 篇文章

订阅专栏

方法均提炼总结于BeautifulSoup官方文档Beautiful Soup 中文文档

接下来用例子带大家一步步了解重要的方法和方法特点

BeautifulSoup可以解析html和xml文档

1.建立一个类似html字符串

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
<b><!--Hey, buddy. Want to buy a used parser?--></b>
"""

2.用BeautifulSoup解析html，能够得到一个的对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

3.对解析返回的对象进行粗略的提取操作

我们将返回的对象统一用soup表示

通过soup可以获取解析页面下标签，例如title，a，p标签等等

好处：操作简便，而且里面还能继续提取属性（父节点，名称，文本），比如提取a标签下的href <a href="http://example.com/tillie"> --- code:soup.a['href']

缺点：只能提取文档的第一个标签，一般要提取的都是包含很多同名的标签，所以这个方法基本不适用，但是可以作为了解

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.b.string
# u'Hey, buddy. Want to buy a used parser'代表文本里的注释内容

soup.title.text
#The Dormouse's story

4.获取多个同名标签的内容

如果想要得到所有的<a>标签，或是通过名字得到比一个tag更多的内容的时候，就需要用到 Searching the tree 中描述的方法，比如： find_all（）

find(name, attrs, recursive, string, **kwargs)：获取匹配的第一个标签；
find_all(name, attrs, recursive, string, limit, **kwargs) ：返回结果是值包含一个元素的列表；

name：是根据标签的名称进行匹配，name的值相当于过滤条件，可以是一个具体的标签名，多个标签名组成的列表，或者是一个正在表达式，甚至是函数方法等等。
attrs：是根据标签的属性进行匹配。
recursive：是否递归搜索，默认为True，会搜索当前tag的所有子孙节点，设置为False，则只搜索儿子节点。
string：是根据标签的文本内容去匹配。
limit：设置查询的结果数量。
kwargs：也是根据标签的属性进行匹配，与attrs的区别在于写法不一样，且属性的key不能是保留字，也不能与其他参数名相同

这个方法最为常用，往下我们会继续介绍怎么获取同名标签然后过滤，首先介绍几种获取方式

4.1直接获取同名标签的全部内容

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4.2正则表达式获取同名标签的全部内容

如果传入正则表达式作为参数，Beautiful Soup会通过正则表达式的来匹配内容.下面例子中找出所有以b开头的标签，这表示<body>和<b>标签都应该被找到

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

4.3列表获取多种类型同名标签的全部内容

如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和<b>标签

soup.find_all(["a", "b"])
#  [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
#  <b><!--Hey, buddy. Want to buy a used parser?--></b>]

4.4通过添加多种属性进行筛选内容

一般情况下我们都会遇到带很多修饰标签的语句（style，class，id等），一般为css，我们除了定位到标签名字还要进一步筛选

1.attrs

通过用字典来添加约束条件，

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 查找所有id属性值为link1或者link2的标签
soup.find_all(attrs={'id': ['link2', 'link1']})

# 查找多个属性
soup.find_all(attrs={'id':'link1','class':'sister'})

2.string

通过参数可以搜搜文档中的字符串内容.与参数的可选值一样，参数接受字符串，正则表达式，列表， True .看例子

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

3.limit

find_all()方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到的限制时,就停止搜索返回结果

文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量:

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

4.recursive

调用tag的方法时，Beautiful Soup会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数,一般用法较少

soup.html.find_all("title")
#The Dormouse's story

4.5与文档树有关的其他搜寻方法

跟树的特性一样，拥有兄弟节点，父节点，自然Beautiful Soup包含这种搜索方法，不过用的较少，仅做了解，需要使用的时候再做了解，用法和find_all()差不多

这2个方法通过 .next_siblings 属性对当tag的所有后面解析的兄弟tag节点进行迭代, 方法返回所有符合条件的后面的兄弟节点, 只返回符合条件的后面的第一个tag节点

find_next_siblings( name , attrs , recursive , string , **kwargs )

find_next_sibling( name , attrs , recursive , string , **kwargs )

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_next_siblings("a")

这2个方法通过 .previous_siblings 属性对当前tag的前面解析的兄弟tag节点进行迭代, 方法返回所有符合条件的前面的兄弟节点, 方法返回第一个符合条件的前面的兄弟节点

find_previous_siblings( name , attrs , recursive , string , **kwargs )

find_previous_sibling( name , attrs , recursive , string , **kwargs )

last_link = soup.find("a", id="link3")
last_link
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_link.find_previous_siblings("a")

这2个方法通过 .next_elements 属性对当前tag的之后的tag和字符串进行迭代, 方法返回所有符合条件的节点, 方法返回第一个符合条件的节点

find_all_next( name , attrs , recursive , string , **kwargs )

find_next( name , attrs , recursive , string , **kwargs )

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.next_element('a')

这2个方法通过 .previous_elements 属性对当前节点前面的tag和字符串进行迭代, 方法返回所有符合条件的节点, 方法返回第一个符合条件的节点

find_all_previous( name , attrs , recursive , string , **kwargs )

find_previous( name , attrs , recursive , string , **kwargs )

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_previous("p")

5.注意事项

1.编码

任何HTML或XML文档都有自己的编码方式,比如ASCII 或 UTF-8,但是使用Beautiful Soup解析后,文档都被转换成了Unicode

markup = "<h1>Sacr\xc3\xa9 bleu!</h1>"
soup = BeautifulSoup(markup)
soup.h1
# <h1>Sacré bleu!</h1>
soup.h1.string
# u'Sacr\xe9 bleu!'

Beautiful Soup用了编码自动检测子库来识别当前文档编码并转换成Unicode编码. 对象的属性记录了自动识别编码的结果:BeautifulSoup.original_encoding

通过传入参数来指定编码方式:from_encoding

soup = BeautifulSoup(markup, from_encoding="iso-8859-8")

2.常见错误

SyntaxError: Invalid syntax (异常位置在代码行: ),因为Python2语法的代码(没有经过迁移)直接在Python3中运行ROOT_TAG_NAME = u'[document]'
ImportError: No module named HTMLParser 因为在Python3中执行Python2版本的Beautiful Soup
ImportError: No module named html.parser 因为在Python2中执行Python3版本的Beautiful Soup
ImportError: No module named BeautifulSoup 因为在没有安装BeautifulSoup3库的Python环境下执行代码,或忘记了BeautifulSoup4的代码需要从包中引入bs4
ImportError: No module named bs4 因为当前Python环境下还没有安装BeautifulSoup4
UnicodeEncodeError: 'charmap' codec can't encode character u'\xfoo' in position bar (或其它类型的 )的错误,主要是两方面的错误(都不是Beautiful Soup的原因),第一种是正在使用的终端(console)无法显示部分Unicode,参考 Python wiki ,第二种是向文件写入时,被写入文件不支持部分Unicode,这时只要用方法将编码转换为UTF-8.UnicodeEncodeErroru.encode("utf8")
KeyError: [attr] 因为调用方法而引起,因为这个tag没有定义该属性.出错最多的是和 .如果不确定某个属性是否存在时,用方法去获取它,跟获取Python字典的key一样tag['attr']KeyError: 'href'KeyError: 'class'tag.get('attr')
AttributeError: 'ResultSet' object has no attribute 'foo' 错误通常是因为把的返回结果当作一个tag或文本节点使用,实际上返回结果是一个列表或对象的字符串,需要对结果进行循环才能得到每个节点的属性.或者使用方法仅获取到一个节点find_all()ResultSet.foofind()
AttributeError: 'NoneType' object has no attribute 'foo' 这个错误通常是在调用了方法后直节点取某个属性 .foo 但是方法并没有找到任何结果,所以它的返回值是 .需要找出为什么的返回值是 .find()find()Nonefind()None