Python爬虫包 BeautifulSoup 学习（八） parent等应用

最新推荐文章于 2024-09-02 09:35:16 发布

SuPhoebe

最新推荐文章于 2024-09-02 09:35:16 发布

阅读量5.6k

点赞数 6

分类专栏： Python & Django开发文章标签： python bs4 爬虫

本文链接：https://blog.csdn.net/u013007900/article/details/54691666

版权

Python & Django开发专栏收录该内容

24 篇文章 4 订阅

订阅专栏

继续使用上篇的html页面内容：

html_doc = """ 
<html>
<head><title>The Dormouse's story</title></head> 
<p class="title"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
</html>"""

继续分析文档树 ,每个 tag或字符串都有父节点 :被包含在某个 tag中。

.parent

通过 .parent 属性来获取某个元素的父节点。在例子html文档中，<head>标签是<title>标签的父节点:

title_tag = soup.title 
title_tag
# <title>The Dormouse's story</title> 
title_tag.parent 
# <head><title>The Dormouse's story</title></head>

title下的字符串也有父节点:<title>标签

title_tag.string.parent 
# <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象:

html_tag = soup.html 
type(html_tag.parent) 
# <class 'bs4.BeautifulSoup'>

BeautifulSoup 对象的 .parent 是None。

.parents

通过元素的.parents属性可以递归得到元素的所有父辈节点 , 下面的例子使用了 .parents方法遍历了<a>标签到根节点的所有节点：

link = soup.a 
link 
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
for parent in link.parents: 
    if parent is None: 
        print(parent) 
    else: 
        print(parent.name) 
# p 
# body 
# html 
# [document] 
# None

兄弟节点

举例说明：

<a>
    <b>text1</b>
    <c>text2</c>
</a>

这里的b和c节点为兄弟节点.

.next_sibling 和 .previous_sibling

在文档树中，使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点：

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
sibling_soup.b.next_sibling   
sibling_soup.c.previous_sibling 

# <c>text2</c> 
# <b>text1</b>

b 标签有.next_sibling 属性 ,但是没有 .previous_sibling 属性，因为 b标签在同级节点中是第一个。同理，c标签有 .previous_sibling 属性，却没有 .next_sibling 属性。

link = soup.a
link 
link.next_sibling 

# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
# u',\n'

注意：第一个a标签的next_sibling 属性值为，\n

link.next_sibling.next_sibling 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

第一个a标签的next_sibling的next_sibling 属性值为Lacie

.next_siblings 和 .previous_siblings

通过 .next_siblings 和 .previous_siblings 属性对当前节点的兄弟节点迭代输出：

for sibling in soup.a.next_siblings: 
    print(repr(sibling)) # u',\n' 

for sibling in soup.find(id="link3").previous_siblings:                                 print(repr(sibling)) 

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
# u' and\n' 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
# u'; and they lived at the bottom of a well.' 
# None 


# ' and\n' 
# <a class="sister" 
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> 
# u'Once upon a time there were three little sisters; and their names were\n' 
# None

回退和前进

举例html如下：

<html><head><title>The Dormouse's story</title></head> <p class="title"><b>The Dormouse's story</b></p>

HTML 解析器把这段字符串转换成一连的事件 : “ 打开标签 ”添加一段字符串 ”,关闭标签 ”,”打开标签 ”, 等。

Beautiful Soup提供了重现解析器初始化过程的方法。

next_element 和 .previous_element

.next_element 属性指向解析过程中下一个被的对象 (字符串或 tag),结果可能与 .next_sibling 相同 ,但通常是不一样的。

last_a_tag = soup.find("a", id="link3") 
last_a_tag 
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
last_a_tag.next_sibling 
# '; and they lived at the bottom of a well.'

但这个 <a>标签的 .next_element 属性结果是在标签被解析之后的内容 ,不是<a>标签后的句子部分 ,应该是字符串 ”Tillie”:

last_a_tag.next_element 
# u'Tillie'

.previous_element 属性刚好与.next_element 相反 ,它指向当前被解析的对象的前一个解析对象 :

last_a_tag.previous_element 
# u' and\n' 
last_a_tag.previous_element.next_element
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

.next_elements 和 .previous_elements

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或后访问文档解析内容 ,就好像文档正在被解析一样 :

for element in last_a_tag.next_elements:                  print(repr(element)) 
# u'Tillie' 
# u';\nand they lived at the bottom of a well.' 
# u'\n\n' 
# <p class="story">...</p> 
# u'...' 
# u'\n' 
# None