Beautifulsoup 库 -- 03 -- 搜索文档树

最新推荐文章于 2023-06-08 21:12:05 发布

S_numb

最新推荐文章于 2023-06-08 21:12:05 发布

阅读量645

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/S_numb/article/details/120218087

版权

Python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

1. 搜索文档树

文档依旧是 Alice；

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Beautiful Soup 定义了很多搜索方法，这里着重介绍 2 个：
- find()；
- find_all() ；

1.1 过滤器

一共有 string、list、regular expression、True、function 五种类型的过滤器；
string 过滤器主要用于完全匹配属性值；
list 过滤器可以极其方便的查找多个值；
regular expression 过滤器可以用于不完全匹配等其他特殊匹配；
True 过滤器可以用来确定存在某些属性；
function 过滤器最为强大，尽管写起来比上述几个过滤器复杂，但是可以实现任何过滤；
详细参考

1.1.1 string

最简单的过滤器是字符串，在搜索方法中传入一个字符串参数，Beautiful Soup 会查找与字符串完整匹配的内容；
下面的例子用于查找文档中所有的 <b> 标签：

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('b'))

输出：

[<b>The Dormouse's story</b>]

如果传入字节码参数，Beautiful Soup 会当作 UTF-8 编码，可以传入一段 Unicode 编码来避免 Beautiful Soup 解析编码出错；

1.1.2 list

如果传入列表参数，Beautiful Soup 会将与列表中任一元素匹配的内容返回。
下面代码找到文档中所有 <a> 标签和 <b>标签：

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all(["a", "b"]))

输出：

[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.1.3 regular expression

如果传入正则表达式作为参数，Beautiful Soup 会通过正则表达式的 search() 来匹配内容。
下面例子中找出所有以 b 开头的标签，这表示 <body> 和 <b> 标签都应该被找到：

soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

输出：

body
b

1.1.4 True

True 可以匹配任何值，下面代码查找到所有的 tag，但是不会返回字符串节点；

soup = BeautifulSoup(html_doc, 'html.parser')
for tag in soup.find_all(True):
    print(tag.name)

输出：

html
head
title
body
p
b
p
a
a
a
p

1.1.5 function

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数；
如果这个方法返回 True 表示当前元素匹配并且被找到，如果不是则反回 False；
下面方法校验了当前元素，如果包含 class 属性却不包含 id 属性，那么将返回 True：

soup = BeautifulSoup(html_doc, 'html.parser')


def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

#将这个方法作为参数传入 find_all() 方法,将得到所有<p>标签:
print(soup.find_all(has_class_but_no_id))

输出：

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]

一开始我对上述输出也有误解，明明是有 <a> 标签的，但是后来通过调试后发现，其实是因为 <a> 标签是在 <p> 标签中包含；

通过一个方法来过滤某一类标签属性的时候，这个方法的参数是要被过滤掉的属性的值（即你不想看到的值）, 而不是这个标签。
下面的例子是找出 href 属性不符合指定正则的 a 标签：

soup = BeautifulSoup(html_doc, 'html.parser')


def not_lacie(href):
    return href and not re.compile("lacie").search(href)


print(soup.find_all(href=not_lacie))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.2 find_all()

find_all( name , attrs , recursive , string , **kwargs )
find_all() 方法搜索当前 tag 的所有 tag 子节点，并判断是否符合过滤器的条件。

1.2.1 name 参数

name 参数可以查找所有名字为 name 的 tag，字符串对象会被自动忽略掉。
soup.find_all("title")
搜索 name 参数的值可以使任一类型的过滤器 ,字符串，正则表达式，列表，方法或是 True 。

1.2.2 keyword 参数

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索，如果包含一个名字为 id 的参数，Beautiful Soup 会搜索每个 tag 的 id 属性.

print(soup.find_all(id='link2'))
--- 输出 ---
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

搜索指定名字的属性时可以使用的参数值包括：字符串 , 正则表达式 , 列表, True .

1.2.3 string 参数

通过 string 参数可以搜搜文档中的字符串内容。
与 name 参数的可选值一样，string 参数接受：字符串 , 正则表达式 , 列表, True。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all(string="Elsie"))
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(string=re.compile("Dormouse")))

输出：

['Elsie']
['Elsie', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]

虽然 string 参数用于搜索字符串，还可以与其它参数混合使用来过滤 tag.Beautiful Soup 会找到 .string 方法与 string 参数值相符的 tag；
搜索内容里面包含“Elsie”的 <a>标签：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all("a", string="Elsie"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

1.2.4 limit 参数

find_all() 方法返回全部的搜索结构，如果文档树很大那么搜索会很慢。
如果我们不需要全部结果，可以使用 limit 参数限制返回结果的数量。
效果与 SQL 中的 limit 关键字类似，当搜索到的结果数量达到 limit 的限制时，就停止搜索返回结果。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all("a", limit=2))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

文档树中有 3 个 tag 符合搜索条件，但结果只返回了 2 个，因为我们限制了返回数量。

1.2.5 recursive 参数

调用 tag 的 find_all() 方法时，Beautiful Soup 会检索当前 tag 的所有子孙节点，如果只想搜索 tag 的直接子节点，可以使用参数 recursive=False。

1.3 像调用 find_all() 一样调用 tag

find_all() 几乎是 Beautiful Soup 中最常用的搜索方法，所以我们定义了它的简写方法。
BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用，这个方法的执行结果与调用这个对象的 find_all() 方法相同：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all("a"))
print("----------------")
print(soup("a"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
----------------
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.4 find()

find( name , attrs , recursive , string , **kwargs )

find_all() 方法将返回文档中符合条件的所有 tag。
文档中只有一个 <body>标签，那么使用 find_all() 方法来查找 <body>标签就不太合适；
使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all('title', limit=1))
print("--------")
print(soup.find('title'))

输出：

[<title>The Dormouse's story</title>]
--------
<title>The Dormouse's story</title>

从上边看出，find_all() 方法的返回结果是值包含一个元素的列表，而 find() 方法直接返回结果。

find_all() 方法没有找到目标是返回空列表；
find() 方法找不到目标时，返回 None。
find_all() 和 find() 只搜索当前节点的所有子节点，孙子节点等。

1.5 find_parents() 和 find_parent()

find_parents() 和 find_parent() 用来搜索当前节点的父辈节点，搜索方法与普通 tag 的搜索方法相同，搜索文档搜索文档包含的内容；
find_parents( name , attrs , recursive , string , **kwargs )
find_parent( name , attrs , recursive , string , **kwargs )

soup = BeautifulSoup(html_doc, 'html.parser')

string_a = soup.find(string="Lacie")

print(string_a.find_parents("a"))
print("------------------")
print(string_a.find_parent("p"))

输出：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
------------------
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

1.6 find_next_siblings() 和 find_next_sibling()

find_next_siblings( name , attrs , recursive , string , **kwargs )
- 返回所有符合条件的后面的兄弟节点；
find_next_sibling( name , attrs , recursive , string , **kwargs )
- 只返回符合条件的后面的第一个 tag 节点；

soup = BeautifulSoup(html_doc, 'html.parser')

link_first = soup.a
print(link_first.find_next_siblings("a"))
print("_________________")
print(link_first.find_next_sibling("a"))

输出：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
_________________
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

1.7 find_previous_siblings() 和 find_previous_sibling()

find_previous_siblings( name , attrs , recursive , string , **kwargs )
- 返回所有符合条件的前面的兄弟节点；
find_previous_sibling( name , attrs , recursive , string , **kwargs )
- 返回第一个符合条件的前面的兄弟节点；

1.8 find_all_next() 和 find_next()

find_all_next( name , attrs , recursive , string , **kwargs )
- 返回所有符合条件的节点；
find_next( name , attrs , recursive , string , **kwargs )
- 返回第一个符合条件的节点；

1.9 find_all_previous() 和 find_previous()

find_all_previous( name , attrs , recursive , string , **kwargs )
- 返回所有符合条件的节点；
find_previous( name , attrs , recursive , string , **kwargs )
- 返回第一个符合条件的节点；

1.10 CSS 选择器

Beautiful Soup支持大部分的CSS选择器；
在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数，即可使用 CSS 选择器的语法找到tag。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("title"))
print("-----")
print(soup.select("p:nth-of-type(3)"))

输出：

[<title>The Dormouse's story</title>]
-----
[<p class="story">...</p>]

通过 tag 标签逐层查找：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("body a"))
print("-----")
print(soup.select("html head title"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
-----
[<title>The Dormouse's story</title>]

找到某个 tag 标签下的直接子标签：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("head > title"))
print("-----")
print(soup.select("p > #link1"))

输出：

[<title>The Dormouse's story</title>]
-----
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

找到兄弟节点标签：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("#link1 ~ .sister"))
print("-----")
print(soup.select("#link1 + .sister"))

输出：

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
-----
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过 CSS 的类名查找：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select(".sister"))
print("-----")
print(soup.select("[class~=sister]"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
-----
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过 tag 的 id 查找：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("#link1"))
print("-----")
print(soup.select("a#link2"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
-----
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种 CSS 选择器查询元素：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select("#link1,#link2"))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select('a[href]'))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select('a[href="http://example.com/elsie"]'))
print(soup.select('a[href^="http://example.com/"]'))

输出：

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

返回查找到的元素的第一个：

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.select_one(".sister"))

输出：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>