解析HTML文档：Beautiful Soup4快速入门

最新推荐文章于 2024-08-06 23:23:32 发布

花_城

最新推荐文章于 2024-08-06 23:23:32 发布

阅读量1.3k

点赞数 3

分类专栏：爬虫文章标签： python 开发语言后端

本文链接：https://blog.csdn.net/qq_39330486/article/details/122513371

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、Beautiful Soup4简介

1.1 Beautiful Soup4简介

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过极为简单的方式，实现文档的查找、导航、修改等功能，能极大的减少我们的工作量。

1.2 Beautiful Soup4快速入门

下面是一个使用方法的例子：

假设要解析的文档html_doc的内容如下：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

使用BeautifulSoup解析这段代码，能够得到一个 BeautifulSoup 的对象通过该对象可以快速获取想要的内容：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

从文档中找到所有<a>标签的链接：

for link in soup.find_all('a'):
    print(link.get('href'))
    # http://example.com/elsie
    # http://example.com/lacie
    # http://example.com/tillie

从文档中获取所有文字内容：

print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

二、安装Beautiful Soup4和解析器

2.1 安装Beautiful Soup4

使用pip安装：

pip install beautifulsoup4

注意：有一个Beautiful Soup库（没有4），那是用于python2的旧版本，不要搞混了！

2.2 安装解析器（可选）

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器： lxml 和 html5lib ，它们的区别如下：

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执，行速度适中，文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快，文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性，以浏览器的方式解析文档，生成HTML5格式的文档	速度慢，不依赖外部扩展

安装方法如下：

pip install lxml

pip install html5lib

推荐使用lxml作为解析器，因为效率更高，且支持xml。

三、Beautiful Soup4进阶

3.1 实例化BeautifulSoup对象

将一段文档传入BeautifulSoup 的初始化方法，就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄（文件对象）：

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")

文档会被转换成Unicode编码，然后BeautifulSoup会自动选择合适的解析器来解析文档。如果需要手动指定解析器，请参考2.2节表格中的使用方法。

3.2 Tag对象及其属性

Tag 对象与XML或HTML原生文档中的标签等价：

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tag有很多方法和属性，在遍历文档树和搜索文档树中有详细解释。现在介绍一下tag中最重要的属性：name和attributes：

name：

获取标签名称：
```
tag.name
# u'b'
```
如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档：
```
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
```

attributes：

我们可以通过字典操作获取tag的属性，假设tag 为：

tag['class']
# u'boldest'

也可以通过.获取：

tag.attrs
# {u'class': u'boldest'}

tag属性的添加，删除或修改，也是和字典操作一样的：

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

对于有多个值的tag属性，Beautiful Soup的以列表的形式进行表达：

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

string：

tag内部的文本，通过string属性获取：
```
soup = BeautifulSoup('Extremely bold')

unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>
```
Beautiful Soup用 NavigableString 类来包装tag中的字符串，与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性。

tag中包含的字符串不能编辑，但是可以用 replace_with方法替换成其它的字符串：
```
tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>
```

3.3 BeautifulSoup对象

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候，可以把它和Tag 对象类比使用，它支持遍历文档树和搜索文档树中描述的大部分的方法。

soup.name
# u'[document]'

四、遍历文档树

一个Tag可能包含多个字符串或其它的Tag，这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。

4.1 获取tag对象

获取tag对象最简单的方法就是直接调用tag的名字，比如获取<head>标签：
```
soup.head
# <head><title>The Dormouse's story</title></head>
```
它还可以链式调用：
```
soup.body.b
# The Dormouse's story
```
但这种直接调用名字的方式，只能获取到当前文档的第一个tag。

比如想获取所有<a>标签，就得使用find_all()方法：

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

返回一个列表。

4.2 获取子节点

.contents 属性可以将tag的所有直接子节点以列表的方式输出。

通过tag的 .children 生成器,可以对tag的所有直接子节点进行遍历：

for child in title_tag.children:
    print(child)
    # The Dormouse's story

.contents 和 .children 属性仅包含tag的直接子节点，而 .descendants 属性可以获取tag的所有子孙节点。

4.3 获取tag内部的字符串

除了tag内部的字符串，.string还可以得到tag的子节点的字符串，但必须保证子节点只有一个，否则会输出None。

如果tag中包含多个字符串，可以使用 .strings 来循环获取：

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'

4.4 获取父节点

通过 .parent 属性来获取某个元素的父节点。
文档的顶层节点比如的父节点是 BeautifulSoup 对象。
BeautifulSoup 对象的 .parent 是None。
通过元素的 .parents 属性可以递归得到元素的所有祖先节点。

4.5 获取兄弟节点

使用 .next_sibling 获取下一个兄弟节点。
使用.previous_sibling获取上一个兄弟节点。
如果一个节点是其父节点的第一个子节点，那么它就没有.previous_sibling。 .next_sibling 属性也是同理。

五、搜索文档树

Beautiful Soup定义了很多搜索方法,这里着重介绍2个：find() 和 find_all() .其它方法的参数和用法类似。

5.1 过滤器

介绍 find_all() 方法前,先介绍一下过滤器的类型，这些过滤器贯穿整个搜索的API。过滤器可以被用在tag的name中、节点的属性中、字符串中或混合使用。

5.1.1 字符串

最简单的过滤器是字符串。在搜索方法中传入一个字符串参数，Beautiful Soup会查找与字符串完整匹配的内容，下面的例子用于查找文档中所有的标签：

soup.find_all('b')
# [<b>The Dormouse's story</b>]

如果传入字节码参数，Beautiful Soup会当作UTF-8编码，可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错。

5.1.2 正则表达式

如果传入正则表达式作为参数，Beautiful Soup会通过正则表达式的 search() 来匹配内容。下面例子中找出所有以b开头的标签，这表示和标签都应该被找到：

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

5.1.3 列表

如果传入列表参数，Beautiful Soup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有标签和标签：

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5.1.4 True

True 可以匹配任何值，下面代码查找到所有的tag，但是不会返回节点的字符串：

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

5.1.5 方法

如果没有合适过滤器，那么还可以定义一个方法。该方法只接受一个元素参数，如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False。

比如，寻找包含 class 属性却不包含 id 属性的标签，可以使用如下方法：

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

通过一个方法来过滤一类标签属性的时候，这个方法的参数是要被过滤的属性的值，而不是这个标签。下面的例子是找出 href 属性不符合指定正则的 a 标签：

def not_lacie(href):
        return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5.2 搜索方法

5.2.1 find_all方法

find_all(name, attrs, recursive, string, **kwargs) 方法搜索当前tag的所有子节点，并判断是否符合过滤器的条件。

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

它有以下参数：

name：

要搜索的tag的名称，它的值可以是任意类型的过滤器：字符串、正则表达式、列表、方法或是 True 。
关键字参数：

如果一个指定名字的参数不是形参内的任何参数，搜索时会把该参数当作属性和值来搜索。比如，搜索包含 id属性为“link2”的tag：
```
soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```
这种参数的值可以是字符串、正则表达式、列表和True。

class_：

该参数用来通过CSS搜索tag，由于标识CSS类名的关键字 class 在Python中是保留字，所以改为了class_。

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_参数同样可以接受各种类型的过滤器。

string：

通过 string 参数可以搜索文档中的字符串内容。与 name 参数类似 string 参数接受字符串、正则表达式、列表和True。
```
soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
```
limit：

如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。
recursive：

find_all() 方法会检索当前tag的所有子孙节点，如果只想搜索tag的直接子节点，可以使用参数 recursive=False 。

5.2.2 find_all方法简单调用方式

find_all() 是Beautiful Soup中最常用的搜索方法，所以官方为其定义了简写方法。

下面两行代码是等价的：

soup.find_all("a")
soup("a")

下面两行也是等价的：

soup.title.find_all(string=True)
soup.title(string=True)

5.2.3 find方法

find(name, attrs, recursive, string, **kwargs)方法返回找到的符合条件的第一个tag，它几乎等同于find_all设置了 limit=1 参数。区别在于：

前者返回一个结果，后者则返回一个列表。
find_all() 方法没有找到目标是返回空列表， find() 方法找不到目标时，返回 None 。

tag对象链式调用的原理，就是多次调用当前tag的 find() 方法，所以下面两行代码是等价的：

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

5.3 CSS选择器

Beautiful Soup支持大部分的CSS选择器，在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数，即可使用CSS选择器的语法找到tag：

soup.select("title")
# [<title>The Dormouse's story</title>]

通过tag标签逐层查找：

soup.select("html head title")
# [<title>The Dormouse's story</title>]

子标签查找：

soup.select("head > title")
# [<title>The Dormouse's story</title>]

通过CSS的类名查找：

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

同时用多种CSS选择器查询元素：

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过属性的值来查找：

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

返回查找到的元素的第一个：

soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

还有很多CSS选择器可以使用，就不一一列举了……

关于Beautiful Soup4更加详细的用法，请参考官方文档：传送门

花_城

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录