requests模块和XPath表达式

最新推荐文章于 2024-05-14 22:18:18 发布

GaGa，

最新推荐文章于 2024-05-14 22:18:18 发布

阅读量251

点赞数

文章标签： python 网络爬虫爬虫 pycharm

本文链接：https://blog.csdn.net/qq_55909023/article/details/134500085

版权

requests模块和XPath表达式

1.requests模块

requests 是一个流行的 Python 库，用于发送 HTTP 请求。发起网络请求的主要是使用 requests 模块中的 get()和 post() 函数。

发送 GET 请求： 使用 requests.get() 方法可以发送 GET 请求。

import requests

response = requests.get('https://www.example.com')
print(response.text)

**发送 POST 请求：**使用 requests.post() 方法可以发送 POST 请求。你可以通过 data 参数传递 POST 数据。

import requests

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://www.example.com/post-endpoint', data=payload)
print(response.text)

处理响应：response 对象包含服务器返回的所有信息，包括状态码、响应头和响应正文。你可以使用这些信息进行处理。

import requests

response = requests.get('https://www.example.com')
print("Status Code:", response.status_code)
print("Headers:", response.headers)
print("Content:", response.text)

**处理 JSON 数据：**如果服务器返回 JSON 格式的数据，你可以使用 response.json() 方法解析它。

import requests

response = requests.get('https://api.example.com/data')
data = response.json()
print(data)

处理异常：requests 允许你捕获各种异常，例如连接超时、请求超时等。你可以使用 try 和 except 来处理这些异常。

import requests
from requests.exceptions import Timeout

try:
    response = requests.get('https://www.example.com', timeout=1)
    response.raise_for_status()  # 如果请求不成功，会抛出异常
except Timeout:
    print("The request timed out")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

2.XPath表达式

XPath（XML Path Language）是一种用于在 XML 文档中导航和查询信息的语言。XPath 是 W3C（World Wide Web Consortium）制定的标准，广泛应用于各种 XML 处理的上下文，包括网页解析、数据提取、XSLT 转换等。以下是 XPath 表达式的一些重要概念和语法：

节点（Node）：

XML 文档由各种类型的节点组成，包括元素节点、属性节点、文本节点等。
/ 用于从文档的根节点开始选择节点。
// 用于从文档中的任意位置选择节点。

路径表达式：

路径表达式描述了从一个节点到另一个节点的路径。
例如，/bookstore/book/title 表示选择根节点下的 bookstore 元素，然后选择其下的 book 元素，再选择 book 元素下的 title 元素。

谓语（Predicate）：

谓语用于过滤节点，以便选择满足特定条件的节点。
例如，/bookstore/book[1] 表示选择 bookstore 元素下的第一个 book 元素。

通配符：

* 表示匹配任何元素节点。
@* 表示匹配任何属性节点。

条件筛选：

[@attribute='value'] 用于选择具有特定属性值的元素。
例如，/bookstore/book[@category='fiction'] 表示选择 category 属性为 ‘fiction’ 的 book 元素。

逻辑运算符：

and、or 和 not 用于组合多个条件。
例如，//book[price>35 and price<50] 表示选择价格在 35 到 50 之间的 book 元素。

函数：

XPath 提供一些内置函数，如 text() 用于选择文本节点，last() 用于选择最后一个节点，等等。
例如，//book[position()=last()] 表示选择最后一个 book 元素。

让我通过一些具体的例子来演示XPath表达式的使用。

<bookstore>
  <book category="fiction">
    <title lang="en">Harry Potter</title>
    <author>J.K. Rowling</author>
    <price>29.99</price>
  </book>
  <book category="non-fiction">
    <title lang="fr">Le Petit Prince</title>
    <author>Antoine de Saint-Exupéry</author>
    <price>24.95</price>
  </book>
</bookstore>

选择所有书籍的标题：
- XPath 表达式：//bookstore/book/title/text()
- 结果：Harry Potter, Le Petit Prince
选择第一本书的作者：
- XPath 表达式：//bookstore/book[1]/author/text()
- 结果：J.K. Rowling
选择价格小于 30 的书籍标题：
- XPath 表达式：//bookstore/book[price<30]/title/text()
- 结果：Harry Potter，Le Petit Prince
选择包含属性 lang=“en” 的书籍标题：
- XPath 表达式：//bookstore/book/title[@lang='en']/text()
- 结果：Harry Potter
选择最后一本书的价格：
- XPath 表达式：//book[position()=last()]/price/text()
- 结果：24.95