Beautifulsoup 库 -- 02 -- 遍历文档树

最新推荐文章于 2023-02-16 15:15:50 发布

S_numb

最新推荐文章于 2023-02-16 15:15:50 发布

阅读量543

点赞数 2

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/S_numb/article/details/120201125

版权

Python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

文章目录

1. 遍历文档树

1. 遍历文档树

测试文档：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

怎样从文档的一段内容找到另一段内容

1.1 子节点

一个 Tag 可能包含多个字符串或其它的 Tag，这些都是这个 Tag 的子节点；
Beautiful Soup 提供了许多操作和遍历子节点的属性。

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

1.1.1 tag 的名字

操作文档树最简单的方法就是告诉它你想获取的 tag 的 name。
如果想获取 <head> 标签，只要用 soup.head ：
可以在文档树的 tag 中多次调用这个方法。可以获取 <body>标签中的第一个<b> 标签
- soup.body.b
通过点取属性的方式只能获得当前名字的第一个 tag;
如果想要得到所有的 <a> 标签，或是通过名字得到比一个 tag 更多的内容的时候，就需要用到 Searching the tree 中描述的方法，比如: find_all()

print(soup.p)
print(soup.p)
print(soup.find_all('a'))

输出：

<p class="title"><b>The Dormouse's story</b></p>
<p class="title"><b>The Dormouse's story</b></p>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

1.1.2 contents 和 children 属性

tag 的 contents 属性可以将 tag 的子节点以列表的方式输出：

tag_head = soup.head
print(tag_head)
print(tag_head.contents)

tag_title = tag_head.contents[0]
print(tag_title)
print(tag_title.contents)

输出：

<head><title>The Dormouse's story</title></head>
[<title>The Dormouse's story</title>]
<title>The Dormouse's story</title>
["The Dormouse's story"]

BeautifulSoup 对象本身一定会包含子节点，也就是说 <html> 标签也是 BeautifulSoup 对象的子节点：

print(len(soup.contents))
print(soup.contents[0].name)
print(soup.contents[1].name)

输出：

2
None
html

为什么会和教程不一样呢？正常情况下只会有一个子节点，也就是 html，这里为什么会有两个而且，第一个为 None；
答：因为，我们输入文本文档时，这种格式默认前边有空行（空格），所以删除后，即是 1；

字符串没有 contents 属性，因为字符串没有子节点；
通过 tag 的 children 生成器，可以对 tag 的子节点进行循环：

tag_title = soup.title

for child in tag_title.children:
    print(child)

输出：

The Dormouse's story

1.1.3 descendants

contents 和 children 属性仅包含 tag 的直接子节点。
- 例如：<head> 标签只有一个直接子节点 <title>；
- <title> 标签也包含一个子节点：字符串 “The Dormouse’s story”；
- 这种情况下字符串 “The Dormouse’s story”也属于 <head> 标签的子孙节点；
descendants 属性可以对所有 tag 的子孙节点进行递归循环;

tag_head = soup.head

for child in tag_head.descendants:
    print(child)

输出：

title>The Dormouse's story</title>
The Dormouse's story

1.1.4 string

如果 tag 只有一个 NavigableString 类型子节点，那么这个 tag 可以使用 string 得到子节点：

tag_title = soup.title
print(tag_title.string)

输出：

The Dormouse's story

如果一个 tag 仅有一个子节点，那么这个 tag 也可以使用 string 方法，输出结果与当前唯一子节点的 string 结果相同：
如果 tag 包含了多个子节点，tag 就无法确定 string 方法应该调用哪个子节点的内容， string 的输出结果是 None：

1.1.5 strings 和 stripped_strings

如果 tag 中包含多个字符串，可以使用 strings 来循环获取：

soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.strings:
    print(repr(string)) #repr(将对象转化为供解释器读取的形式)

输出：

u'\n'
u"The Dormouse's story"
u'\n'
u'\n'
u"The Dormouse's story"
u'\n'
u'Once upon a time there were three little sisters; and their names were\n'
u'Elsie'
u',\n'
u'Lacie'
u' and\n'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
u'...'
u'\n'

输出的字符串中包含了很多空格或空行，使用 stripped_strings 可以去除多余空白内容：
- 全部是空格的行会被忽略掉,段首和段末的空白会被删除；

soup = BeautifulSoup(html_doc, 'html.parser')
for string in soup.stripped_strings:
    print(repr(string)) #repr(将对象转化为供解释器读取的形式)

输出：

u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u'Elsie'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'

1.2 父节点

每个 tag 或字符串都有父节点：被包含在某个 tag 中；

1.2.1 parent

通过 parent 属性来获取某个元素的父节点；
在例子“爱丽丝”的文档中， <head> 标签是 <title> 标签的父节点;

tag_title = soup.title
print(tag_title)
print(tag_title.parent)

输出

<title>The Dormouse's story</title>
<head><title>The Dormouse's story</title></head>

文档的顶层节点比如 <html> 的父节点是 BeautifulSoup 对象：

tag_html = soup.html
print(type(tag_html.parent))tag_html = soup.html
print(type(tag_html.parent))

输出：

<class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 parent 是None；

1.2.2 parents

通过元素的 parents 属性可以递归得到元素的所有父辈节点；
下面的例子使用了 parents 方法遍历了 <a>标签到根节点的所有节点：

tag_a = soup.a
print(tag_a)
for parent in tag_a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

输出：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
p
body
html
[document]

1.3 兄弟节点

一段文档以标准格式输出时，兄弟节点有相同的缩进级别。
在代码中也可以使用这种关系。
栗子：<b> 标签和 <c> 标签是同一层：他们是同一个元素的子节点,所以 <b> 和 <c> 可以被称为兄弟节点：

from bs4 import BeautifulSoup

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())

输出：

# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

1.3.1 next_sibling 和 previous_sibling

在文档树中，使用 next_sibling 和 previous_sibling 属性来查询兄弟节点；

from bs4 import BeautifulSoup

brother_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(brother_soup.b.next_sibling)
print()
print(brother_soup.c.previous_sibling)

输出：

<c>text2</c>

<b>text1</b>

实际文档中的 tag 的 next_sibling 和 previous_sibling 属性通常是字符串或空白.；
如：

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

第一个 <a> 标签和第二个 <a> 标签之间的顿号和换行符；

1.3.2 next_siblings 和 previous_siblings

通过 next_siblings 和 previous_siblings 属性可以对当前节点的兄弟节点迭代输出；
栗子1：next_siblings

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

for sibling in soup.a.next_siblings:
    print(repr(sibling))

输出：

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

栗子2：previous_siblings

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

输出：

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

1.4 回退和前进

1.4.1 next_element 和 previous_element

next_element 属性指向解析过程中下一个被解析的对象(字符串或 tag )；
结果可能与 next_sibling 相同，但通常是不一样的。
栗子：

tag_a_last = soup.find("a", id="link3")
print(tag_a_last)
print("-------------")
print(tag_a_last.next_sibling)
print("-------------")
print(tag_a_last.next_element)

输出：

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
-------------
;
and they lived at the bottom of a well.
-------------
Tillie

next_sibling 属性得到的是一串字符串，因为它解析时，遇到 <a> 标签会中断；
next_element 属性得到的是在 <a> 标签解析之后的内容，不是 <a>标签后的句子部分；
这是因为在原始文档中，字符串“Tillie” 在分号前出现,解析器先进入<a>标签，然后是字符串“Tillie”，然后关闭</a>标签，然后是分号和剩余部分。分号与<a>标签在同一层级，但是字符串“Tillie”会被先解析。

previous_element 属性刚好与 next_element 相反，它指向当前被解析的对象的前一个解析对象。

S_numb

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Beautifulsoup 库 -- 02 -- 遍历文档树

文章目录1. 遍历文档树1.1 子节点1.1.1 tag 的名字1.1.2 contents 和 children 属性1.1.3 descendants1.1.4 string1.1.5 strings 和 stripped_strings1.2 父节点1.2.1 parent1.2.2 parents1.3 兄弟节点1.3.1 next_sibling 和 previous_sibling1.3.2 next_siblings 和 previous_siblings1.4 回退和前进1.4.1 next
复制链接

扫一扫