深入解析 BeautifulSoup 中的 select() 和 select_one() 函数：定位子孙节点、直接子节点和兄弟节点

最新推荐文章于 2025-04-01 17:14:19 发布

mtx386297

最新推荐文章于 2025-04-01 17:14:19 发布

阅读量504

点赞数 6

分类专栏：网络爬虫文章标签： beautifulsoup

本文链接：https://blog.csdn.net/mtx386297/article/details/146083169

版权

网络爬虫专栏收录该内容

3 篇文章

订阅专栏

引言

在上一部分中，我们介绍了 BeautifulSoup 中 select() 和 select_one() 函数的基本用法。本文将深入探讨如何使用这两个函数来定位 HTML 文档中的子孙节点、直接子节点和兄弟节点，并通过示例代码演示其具体应用。

1. 定位子孙节点

子孙节点是指某个元素的所有后代元素，包括子元素、孙元素、曾孙元素等。在 CSS 选择器中，使用空格来表示子孙关系。

示例：

from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <div class="container">
      <p>Paragraph 1</p>
      <div>
        <p>Paragraph 2</p>
      </div>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
paragraphs = soup.select('div.container p')
for p in paragraphs:
    print(p.text)

输出：

Paragraph 1
Paragraph 2

在这个例子中，select('div.container p') 选择了所有位于 class="container" 的 <div> 元素内的 <p> 元素，包括直接子节点和子孙节点。

2. 定位直接子节点

直接子节点是指某个元素的直接下级元素。在 CSS 选择器中，使用 > 来表示直接子节点关系。

示例：

direct_children = soup.select('div.container > p')
for p in direct_children:
    print(p.text)

输出：

Paragraph 1

在这个例子中，select('div.container > p') 只选择了 class="container" 的 <div> 元素的直接子节点 <p> 元素，而不包括子孙节点。

3. 定位兄弟节点

兄弟节点是指拥有相同父元素的元素。在 CSS 选择器中，使用 + 和 ~ 来表示兄弟节点关系。

+：选择紧邻的下一个兄弟节点。
~：选择所有后续的兄弟节点。

示例：

html = """
<html>
  <body>
    <div class="container">
      <p>Paragraph 1</p>
      <p>Paragraph 2</p>
      <p>Paragraph 3</p>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# 选择紧邻的下一个兄弟节点
next_sibling = soup.select_one('p + p')
print(next_sibling.text)  # 输出: Paragraph 2

# 选择所有后续的兄弟节点
all_siblings = soup.select('p ~ p')
for p in all_siblings:
    print(p.text)  # 输出: Paragraph 2, Paragraph 3

4. 综合应用

在实际应用中，我们经常需要结合使用这些选择器来定位复杂的节点结构。

示例：

html = """
<html>
  <body>
    <div class="container">
      <div class="header">
        <h1>Title</h1>
        <p>Subtitle</p>
      </div>
      <div class="content">
        <p>Paragraph 1</p>
        <div>
          <p>Paragraph 2</p>
        </div>
      </div>
      <div class="footer">
        <p>Footer content</p>
      </div>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# 选择 .content 下的所有 <p> 元素
content_paragraphs = soup.select('div.content p')
for p in content_paragraphs:
    print(p.text)  # 输出: Paragraph 1, Paragraph 2

# 选择 .header 下的直接子节点 <p> 元素
header_subtitle = soup.select_one('div.header > p')
print(header_subtitle.text)  # 输出: Subtitle

# 选择 .content 下的第一个 <p> 元素的后续兄弟节点
first_paragraph_siblings = soup.select('div.content p:first-child ~ p')
for p in first_paragraph_siblings:
    print(p.text)  # 输出: Paragraph 2