BeautifulSoup 搜索文档树

最新推荐文章于 2023-04-25 20:11:29 发布

alexander068

最新推荐文章于 2023-04-25 20:11:29 发布

阅读量310

点赞数

分类专栏： python 文章标签： BeautifulSoup find_all select CSS选择器文档树搜索

本文链接：https://blog.csdn.net/alexander068/article/details/113762975

版权

python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

搜索文档树，最核心的两个方法，find_all() 和 select()

（1）find_all( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

1）name 参数

name 参数可以查找所有名字为 name 的tag,字符串对象会被自动忽略掉

A.传字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签

soup.find_all('b')

# [The Dormouse's story]

print soup.find_all('a')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

B.传正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示<body>和标签都应该被找到

import re

for tag in soup.find_all(re.compile("^b")):

print(tag.name)

# body

# b

C.传列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有<a>标签和标签

soup.find_all(["a", "b"])

# [The Dormouse's story,

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.传 True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):

print(tag.name)

# html

# head

# title

# body

# p

# b

# p

# a

E.传方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

1 2	`def` `has_class_but_no_id(tag):` `return` `tag.has_attr('class')` `and` `not` `tag.has_attr('id')`

将这个方法作为参数传入 find_all() 方法,将得到所有标签:

soup.find_all(has_class_but_no_id)

# [The Dormouse's story,

# Once upon a time there were...,

# ...]

2）keyword 参数

注意：如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性

1 2	`soup.find_all(id='link2')` `# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]`

如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性

1 2	`soup.find_all(href=re.compile("elsie"))` `# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]`

使用多个指定名字的参数可以同时过滤tag的多个属性

1 2	`soup.find_all(href=re.compile("elsie"),` `id='link1')` `# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]`

在这里我们想用 class 过滤，不过 class 是 python 的关键词，这怎么办？加个下划线就可以

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性

soup.find_all("a", class_="sister")

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

1 2	`data_soup.find_all(attrs={"data-foo":` `"value"})` `# [<div data-foo="value">foo!</div>]`

3）text 参数，等同于string 参数

通过 text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True

soup.find_all(text="Elsie")

# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])

# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))

[u"The Dormouse's story", u"The Dormouse's story"]

4）limit 参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢.如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量.效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果.

文档树中有3个tag符合搜索条件,但结果只返回了2个,因为我们限制了返回数量

soup.find_all("a", limit=2)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive 参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

一段简单的文档:

复制代码代码如下:

<html>
<head>
<title>
The Dormouse's story
</title>
</head>
...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title")

# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)

# []

（2）find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果
（3）find_parents() find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等. find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容
（4）find_next_siblings() find_next_sibling()

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点
（5）find_previous_siblings() find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点, find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点
（6）find_all_next() find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点
（7）find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代, find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点

注：以上（2）（3）（4）（5）（6）（7）方法参数用法与 find_all() 完全相同，原理均类似，在此不再赘述。

8.CSS选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

（1）通过标签名查找

print soup.select('title')

#[<title>The Dormouse's story</title>]

print soup.select('a')

print soup.select('b')

#[The Dormouse's story]

（2）通过类名查找

1 2	`print` `soup.select('.sister')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]`

（3）通过 id 名查找

1 2	`print` `soup.select('#link1')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

（4）组合查找

组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开.

具体原理可参考：https://www.runoob.com/cssref/css-selectors.html 筛选出常用CSS选择器

选择器	示例	示例说明
.class	.intro	选择所有class="intro"的元素，相对于特殊的attrs [class=intro]
#id	#firstname	选择所有id="firstname"的元素，相对于特殊的attrs [id=firstname]
*	*	选择所有元素
element	p	选择所有<p>元素
element,element	div,p	选择所有<div>元素和<p>元素
element element	div p	选择<div>元素内的所有<p>元素
element>element	div>p	选择所有父级是 <div> 元素的 <p> 元素
element+element	div+p	选择所有紧接着<div>元素之后的<p>元素，div后面第一个兄弟节点
[attribute]	[target]	选择所有带有target属性元素
[attribute=value]	[target=-blank]	选择所有使用target="-blank"的元素
[attribute~=value]	[title~=flower]	选择标题属性包含单词"flower"的所有元素
[attribute\|=language]	[lang\|=en]	选择 lang 属性以 en 为开头的所有元素
element1~element2	p~ul	选择p元素之后的每一个ul元素，ul 和 p是同一层级的，P后面的兄弟
[attribute^=value]	a[src^="https"]	选择每一个src属性的值以"https"开头的元素
[attribute$=value]	a[src$=".pdf"]	选择每一个src属性的值以".pdf"结尾的元素
[attribute*=value]	a[src*="runoob"]	选择每一个src属性的值包含子字符串"runoob"的元素
:first-of-type	p:first-of-type	选择每个p元素是其父级的第一个p元素
:last-of-type	p:last-of-type	选择每个p元素是其父级的最后一个p元素
:only-of-type	p:only-of-type	选择每个p元素是其父级的唯一p元素
:nth-of-type(n)	p:nth-of-type(2)	选择每个p元素是其父级的第二个p元素
:nth-last-of-type(n)	p:nth-last-of-type(2)	选择每个p元素的是其父级的第二个p元素，从最后一个子项计数
:only-child	p:only-child	选择每个p元素是其父级的唯一子元素
:nth-child(n)	p:nth-child(2)	选择每个p元素是其父级的第二个子元素
:nth-last-child(n)	p:nth-last-child(2)	选择每个p元素的是其父级的第二个子元素，从最后一个子项计数
:last-child	p:last-child	选择每个p元素是其父级的最后一个子级。
:root	:root	选择文档的根元素
:empty	p:empty	选择每个没有任何子级的p元素（包括文本节点）但可包含空格或换行
:not(selector)	:not(p)	选择每个并非p元素的元素

1 2	`print` `soup.select('p #link1')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

直接子标签查找

1 2	`print` `soup.select("head > title")` `#[<title>The Dormouse's story</title>]`

（5）属性查找

查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select("head > title")

#[<title>The Dormouse's story</title>]

print soup.select('a[href="http://example.com/elsie"]')

#[<a class="sister" href="http://example.com/elsie" id="link1"></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

1 2	`print` `soup.select('p a[href="http://example.com/elsie"]')` `#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]`

好，这就是另一种与 find_all 方法有异曲同工之妙的查找方法。

（6）获取文本及属性

soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.select('p.story')
s[0].get_text()  # p节点及子孙节点的文本内容
s[0].get_text("|")  # 指定文本内容的分隔符
s[0].get_text("|", strip=True)  # 去除文本内容前后的空白
print(s[0].get("class"))  # p节点的class属性值列表（除class外都是返回字符串）

（7）其他

html_doc = """<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    </p>
        and they lived at the bottom of a well.
    <p class="story">...</p>
</body>
"""
from bs4 import BeautifulSoup
'''
以列表的形式返回
'''
soup = BeautifulSoup(html_doc, 'html.parser')
soup.select('title')  # title标签
soup.select("p:nth-of-type(3)")  # 第三个p节点
soup.select('body a')  # body下的所有子孙a节点
soup.select('p > a')  # 所有p节点下的所有a直接节点
soup.select('p > #link1')  # 所有p节点下的id=link1的直接子节点
soup.select('#link1 ~ .sister')  # id为link1的节点后面class=sister的所有兄弟节点
soup.select('#link1 + .sister')  # id为link1的节点后面class=sister的第一个兄弟节点
soup.select('.sister')  # class=sister的所有节点
soup.select('[class="sister"]')  # class=sister的所有节点
soup.select("#link1")  # id=link1的节点
soup.select("a#link1")  # a节点，且id=link1的节点
soup.select('a[href]')  # 所有的a节点，有href属性
soup.select('a[href="http://example.com/elsie"]')  # 指定href属性值的所有a节点
soup.select('a[href^="http://example.com/"]')  # href属性以指定值开头的所有a节点
soup.select('a[href$="tillie"]')  # href属性以指定值结尾的所有a节点
soup.select('a[href*=".com/el"]')  # 支持正则匹配

beautifulSoup 官方文档在这里，https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh