零基础学 Python3（20）：解析库 Beautiful Soup（下）

本文链接：https://blog.csdn.net/weixin_45754853/article/details/103673924

人生苦短，我用 Python

引言

前面一篇我们介绍的选择方法都是通过属性来进行选择的，这种方法使用起来非常简单，但是，如果 DOM 结构比较复杂的话，这种方法就不是那么友好了。

所以 Beautiful Soup 还为我们提供了一些搜索方法，如 find_all() 和 find() ， DOM 节点不好直接用属性方法来表示，我们可以直接搜索嘛~~~

find_all()

先看下语法结构：

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件。

name

name 参数可以查找所有名字为 name 的 tag ，字符串对象会被自动忽略掉。



在学习过程中有什么不懂得可以加我的
python学习扣扣qun，784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python，和学习什么内容

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'lxml')

print(soup.find_all(name = "a"))
print(type(soup.find_all(name = "a")[0]))

结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<class 'bs4.element.Tag'>

这次的示例换成了字符串，主要是为了各位同学看起来方便，再也不用去对照着图片看了。

这个示例我们使用了 find_all() 方法，并且传入了 name 参数，值为 a ，含义是我们要查找所有的 <a> 节点，可以看到，返回的结果数据类型是列表，长度为 3 ，并且元素类型为 bs4.element.Tag 。

因为元素类型为 bs4.element.Tag ，我们可以通过前一篇文章介绍的属性直接获取其中的内容：

for a in soup.find_all(name = "a"):
    print(a.string)

结果如下：

Elsie
Lacie
Tillie

attrs

除了可以通过 name 进行搜索，我们还可以通过属性进行查询：

print(soup.find_all(attrs={'id': 'link1'}))
print(soup.find_all(attrs={'id': 'link2'}))
print(type(soup.find_all(attrs={'id': 'link1'})))
print(type(soup.find_all(attrs={'id': 'link2'})))

结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>

这个示例我们传入的是 attrs 参数，参数的数据类型是字典。

string

这个参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象：

import re

print(soup.find_all(text=re.compile('sisters')))

结果如下：

['Once upon a time there were three little sisters; and their names were\n']

keyword

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索，比如下面的示例我们直接搜索 id 为 link 的节点和 class 为 title 的节点：

print(soup.find_all(id='link1'))
print(soup.find_all(class_='title'))

结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<p class="title"><b>The Dormouse's story</b></p>]

当然，我们也可以使用多个指定名字的参数同时过滤 tag 的多个属性：

print(soup.find_all(href=re.compile("elsie"), id='link1'))

结果如下：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

有些 tag 属性在搜索不能使用，比如 HTML5 中的 data-* 属性，这时就需要用到上面介绍过的 attrs 参数了。

find()

find() 和 find_all() 非常的像，只不过 find() 不再像 find_all() 一样直接返回所有的匹配节点，而是只返回第一个匹配的元素。举几个简单的栗子：

print(soup.find(name = "a"))
print(type(soup.find(name = "a")))

结果如下：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>

其余的查询方法各位同学可以参考官方文档，小编这里简单列举一下：

find_parents() 和 find_parent() ：用来搜索当前节点的父辈节点。
find_next_siblings() 和 find_next_sibling() ：前者返回后面所有的兄弟节点，后者返回后面第一个兄弟节点。
find_previous_siblings() 和 find_previous_sibling() ：前者返回前面所有的兄弟节点，后者返回前面第一个兄弟节点。
find_all_next() 和 find_next() ：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。
find_all_previous() 和 find_previous() ：前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。

CSS

Beautiful Soup 除了提供前面这些属性选择、搜索方法等方式来获取节点，还提供了另外一种选择器 —— CSS 选择器。

如果对 CSS 选择器不熟的话，可以参考：https://www.w3school.com.cn/css/index.asp 。

使用 CSS 选择器方法非常简单，只需要调用 select() 方法，传入相应的 CSS 选择器即可，还是写几个简单的示例：

print(soup.select('#link1'))
print(type(soup.select('#link1')[0]))
print(soup.select('.story .sister'))

结果如下：

<class 'bs4.element.Tag'>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以看到，我们使用 CSS 选择器获得的结果同样会是一个列表，并且里面的元素同样是 bs4.element.Tag ，这就意味着我们可以使用它的属性来获取对应的信息。