BeautifulSoup 用法详解 —— 遍历文档树

最新推荐文章于 2024-04-27 06:27:29 发布

小宇不内向

最新推荐文章于 2024-04-27 06:27:29 发布

阅读量1.5k

点赞数 3

分类专栏： BeautifulSoup库文章标签： BeautifulSoup 爬虫

本文链接：https://blog.csdn.net/xiaoyu_wu/article/details/102295184

版权

BeautifulSoup库专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Beautiful Soup 4.4.0 文档： https://beautifulsoup.readthedocs.io/zh_CN/latest/

1. 子节点

一个 Tag 可能包含多个字符串或其它的 Tag，这些都是这个 Tag 的子节点。BeautifulSoup 提供了许多操作和遍历子节点的属性。

操作文档树最简单的方法就是告诉它你想获取的 tag 的 name。

# 获取 <head> 标签
soup.head    
# <head><title>The Dormouse's story</title></head>

# 获取 <title> 标签
soup.title
# <title>The Dormouse's story</title>

# 获取<body>标签中第一个<b>标签
soup.body.b    # <b>The Dormouse's story</b>

通过(.)取属性的方式只能获得当前名字的第一个 tag。要得到所有<a>标签，可以用find_all('a')的方法。

2. .contents 属性

tag 的 .contents 属性可以将 tag 的子节点以列表的方式输出。

head_tag.contents
# [<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag    # <title>The Dormouse's story</title>

len(soup.contents)    # 1
soup.contents[0].name    # u'html'

3. .children 属性

通过 tag 的 .children 生成器，可以对 tag 的子节点进行循环

for child in title_tag.children:
    print(child)

4. .descendants 属性

.contents 属性和 .children 属性仅包含 tag 的直接子节点，而 .descendants 属性可以对所有 tag 的子孙节点进行递归循环。

for child in head_tag.descendants:
    print(child)

5. .strings 和 stripped_strings 属性

如果 tag 包含多个字符串，可以使用 .string 来循环获取，使用 .stripped_strings 可以去除多余空白内容。

for string in soup.strings:
    print(repr(string))

for string in soup.stripped_strings:
    print(repr(string))

6. parent 和 .parents 属性

通过.parent 属性来获取某个元素的父节点，.parents 属性可以递归得到元素的所有父辈节点。

title_tag = soup.title
title_tag    # <title>The Dormouse's story</title>
title_tag.parent    # <head><title>The Dormouse's story</title></head>

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

7. .next_siblings 和 .previous_siblings

通过.next_siblings 和 .previous_siblings 属性可以对当前节的兄弟节点迭代输出。

for sibling in soup.a.next_siblings:
    print(repr(sibling))

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

8. .next_element 和 .previous_elements

next_element 属性指向解析过程中下一个被解析的对象（字符串或 tag ），结果可能与 next_sibling 相同，但通常不一样。

.previous_elements 属性与.next_element相反。

小宇不内向

关注

3
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup 用法详解 —— 遍历文档树

Beautiful Soup 4.4.0文档：https://beautifulsoup.readthedocs.io/zh_CN/latest/1.子节点一个 Tag可能包含多个字符串或其它的 Tag，这些都是这个 Tag的子节点。BeautifulSoup提供了许多操作和遍历子节点的属性。操作文档树最简单的方法就是告诉它你想获取的tag的 name。# 获取...
复制链接

扫一扫