网络爬虫xpath学习和使用

a_blue_fat

于 2024-08-04 18:17:11 发布

阅读量285

点赞数 8

文章标签：爬虫学习

本文链接：https://blog.csdn.net/2301_80120329/article/details/140909518

版权

XPath 提供了强大的语法，用于在 XML 和 HTML 文档中查找和选择节点。以下是 XPath 选择方法的详细讲解：

基础选择方法

1. 节点选择

/：从根节点选取。例如：/html 选择文档根节点的 <html>。
//：从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。例如：//div 选择所有的 <div> 元素。
.：选取当前节点。例如：./title 选取当前节点下的 <title>。
..：选取当前节点的父节点。例如：../@id 选取父节点的 id 属性。

2. 属性选择

@：选取属性。例如：//a/@href 选择所有 <a> 标签的 href 属性。

3. 通配符

*：匹配任何元素节点。例如：//div/* 选择所有在 <div> 元素中的子元素。
@*：匹配任何属性节点。例如：//@* 选择所有属性。

位置路径

1. 绝对路径

从根节点开始。例如：/html/body/div。

2. 相对路径

从当前节点开始。例如：div/a 选取当前节点下的所有 <div> 元素的所有 <a> 元素。

谓词（Predicates）

1. 使用索引

选取特定位置的节点。例如：//div[1] 选取第一个 <div> 元素。

2. 使用条件

选取满足条件的节点。例如：//div[@class='example'] 选取所有 class 属性值为 example 的 <div> 元素。

3. 复合条件

使用逻辑运算符，例如 and、or。例如：//div[@class='example' and @id='main'] 选取 class 属性值为 example 且 id 属性值为 main 的 <div> 元素。

函数

1. 文本函数

text()：选取文本节点。例如：//div/text() 选取所有 <div> 元素的文本。

2. 字符串函数

contains()：判断字符串是否包含特定子字符串。例如：//a[contains(@href, 'example')] 选取所有 href 属性包含 example 的 <a> 元素。
starts-with()：判断字符串是否以特定子字符串开头。例如：//a[starts-with(@href, 'http')] 选取所有 href 属性以 http 开头的 <a> 元素。

3. 位置函数

position()：返回当前节点的位置。例如：//div[position() < 3] 选取前两个 <div> 元素。
last()：返回当前节点集的最后一个位置。例如：//div[last()] 选取最后一个 <div> 元素。

4. 数学函数

sum()：计算节点集的和。例如：sum(//price) 计算所有 <price> 元素的和。

示例代码

以下是如何在 Python 中使用 lxml 库和上述 XPath 表达式的示例：

from lxml import html

html_content = """
<html>
  <head><title>Example Page</title></head>
  <body>
    <div class="example" id="main">
      <h1>Welcome to XPath Tutorial</h1>
      <p>This is a simple example.</p>
      <a href="https://example.com">Click here</a>
    </div>
    <div class="example">
      <p>Another example paragraph.</p>
    </div>
    <div id="last-div">
      <p>Last div content.</p>
    </div>
  </body>
</html>
"""

tree = html.fromstring(html_content)

# 1. 选取所有 <div> 元素
divs = tree.xpath('//div')
print([html.tostring(div).decode() for div in divs])

# 2. 选取第一个 <div> 元素
first_div = tree.xpath('//div[1]')
print(html.tostring(first_div[0]).decode())

# 3. 选取 class 属性为 'example' 的 <div> 元素
example_divs = tree.xpath('//div[@class="example"]')
print([html.tostring(div).decode() for div in example_divs])

# 4. 选取最后一个 <div> 元素
last_div = tree.xpath('//div[last()]')
print(html.tostring(last_div[0]).decode())

# 5. 选取 <div> 元素中的所有文本
div_texts = tree.xpath('//div//text()')
print(div_texts)

# 6. 选取包含 'example' 的链接
example_links = tree.xpath('//a[contains(@href, "example")]/@href')
print(example_links)