Python10-使用urllib模块处理URL

最新推荐文章于 2024-03-21 08:45:26 发布

shlyyy

最新推荐文章于 2024-03-21 08:45:26 发布

阅读量310

点赞数

分类专栏： Python 文章标签： python urllib

本文链接：https://blog.csdn.net/arthurhai521/article/details/133935158

版权

Python 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

Python10-使用urllib模块处理URL

1.url库说明
2.urllib.request
3.urllib.parse

1.url库说明

urllib 是 Python 标准库中的一个模块，提供了用于处理 URL（Uniform Resource Locator）的功能。它包含了一些子模块，如 urllib.request（打开和读取URL）、urllib.parse（解析URL）、urllib.error（urllib.request引发的异常）、urllib.robotparser（解析robots.txt文件）。

2.urllib.request

该子模块提供了用于打开和读取 URL 的功能。使用 urlopen() 可以打开 URL 并读取其内容，使用 urlretrieve() 可以下载文件，使用 Request 可以构建 HTTP 请求对象并发送请求。

2.1urlopen

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)

打开指定的 URL，并返回一个类似文件的对象，可以使用 read() 方法读取其内容。
参数：
- url：要打开的 URL。可以为字符串或者Request对象。
- data：可选参数，要发送到 URL 的数据，可以是字节或字符串。
- timeout：可选参数，设置超时时间。
- cafile：可选参数，指定 CA 证书文件的路径。
- capath：可选参数，指定 CA 证书目录的路径。
- cadefault：可选参数，指定是否使用默认的 CA 证书。
- context：可选参数，指定 SSL 上下文。
返回值：返回response对象，类似文件的对象（file-like的对象），可以使用 read() 方法读取其内容。

2.2urlretrieve

urlretrieve(url, filename=None, reporthook=None, data=None)

下载指定 URL 的内容，并将其保存到本地文件中。
参数：
- url：要下载的 URL。
- filename：可选参数，要保存的文件名，如果未提供，则从 URL 中提取文件名。
- reporthook：可选参数，用于显示下载进度的回调函数。
- data：可选参数，要发送到 URL 的数据，可以是字节或字符串。
返回值：一个包含文件名和服务器响应头的元组。

2.3Request

Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

构建一个 HTTP 请求对象，可以设置请求头等信息，并传递给 urlopen() 方法。
参数：
- url：要请求的 URL。
- data：可选参数，要发送到 URL 的数据，可以是字节或字符串。
- headers：可选参数，要发送的请求头字典。
- origin_req_host：可选参数，请求的原始主机名。
- unverifiable：可选参数，指示请求是否可验证。
- method：可选参数，指定请求方法，如 GET、POST 等。
返回值：一个 Request 对象，可以传递给 urlopen() 方法。

2.4示例

import urllib.request

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200))

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200).decode())

f = urllib.request.urlopen('http://www.baidu.com')
print(f.read(200).decode('utf-8'))
'''
b'<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="'

<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="

<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="
'''

import urllib.request

# 创建一个 Request 对象
url = 'http://www.baidu.com'
req = urllib.request.Request(url)

# 设置请求头
req.add_header('User-Agent', 'Mozilla/5.0')

# 可选：设置请求方法
req.method = 'POST'

# 可选：设置请求数据
data = b'key1=value1&key2=value2'
req.data = data

# 发送请求并获取响应
response = urllib.request.urlopen(req)

# 读取响应内容
content = response.read()

# 打印响应内容
print(content)

urllib.request.Request 用于构建 HTTP 请求对象。通过使用 Request 类，你可以设置请求的 URL、数据、请求头等信息。然后通过调用 urlopen() 方法并将 Request 对象作为参数传递，发送了 HTTP 请求，并得到一个响应对象 response。可以使用 read() 方法读取响应的内容，并在示例中打印出来。

3.urllib.parse

3.1urlparse

urllib.parse 提供了解析 URL、构建 URL 和查询字符串处理等功能。

urlparse(urlstring, scheme='', allow_fragments=True)

解析 URL 字符串，返回一个包含解析结果的命名元组，可以通过属性访问其各个部分，如协议、主机、路径等。
参数：
- urlstring：要解析的 URL 字符串。
- scheme：可选参数，如果 urlstring 不包含协议部分，则使用 scheme 作为默认协议。
- allow_fragments：可选参数，指示是否解析 URL 中的片段标识符。
返回值：一个命名元组，包含解析后的 URL 部分。

3.2urlunparse

urlunparse(parts)

将一个包含 URL 各个部分的元组重新组合成 URL 字符串。
参数：一个包含 URL 各个部分的元组，顺序为 (scheme, netloc, path, params, query, fragment)。
返回值：重新组合后的 URL 字符串。

3.3urlencode

urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)

将字典、元组列表等可迭代对象转换为 URL 编码的查询字符串。
参数：
- query：要编码的查询参数，可以是字典、元组列表等可迭代对象。
- doseq：可选参数，指示是否将具有相同键的多个值作为列表处理。
- safe：可选参数，指定不需要编码的字符。
- encoding：可选参数，指定编码方式。
- errors：可选参数，指定编码错误处理方式。
- quote_via：可选参数，指定引用方式，默认为 quote_plus。
返回值：URL 编码的查询字符串。

3.4quote

quote(string, safe='/', encoding=None, errors=None)

对 URL 中的特殊字符进行编码。
参数：
- string：要编码的字符串。
- safe：可选参数，指定不需要编码的字符。
- encoding：可选参数，指定编码方式。
- errors：可选参数，指定编码错误处理方式。
返回值：编码后的字符串。

3.5unquote

unquote(string, encoding='utf-8', errors='replace')

对 URL 编码的字符串进行解码。
参数：
- string：要解码的字符串。
- encoding：可选参数，指定解码方式。
- errors：可选参数，指定解码错误处理方式。
返回值：解码后的字符串。

3.6示例

解析 URL：

from urllib.parse import urlparse

url = 'https://www.example.com/path/to/page?param1=value1&param2=value2#fragment'

# 解析 URL
parsed_url = urlparse(url)

print(parsed_url.scheme)  # 输出: https
print(parsed_url.netloc)  # 输出: www.example.com
print(parsed_url.path)    # 输出: /path/to/page

构建 URL：

from urllib.parse import urlunparse

parts = ('https', 'www.example.com', '/path/to/page', '', 'param1=value1&param2=value2', 'fragment')

# 构建 URL
url = urlunparse(parts)

print(url)  # 输出: https://www.example.com/path/to/page?param1=value1&param2=value2#fragment

编码查询字符串：

from urllib.parse import urlencode

params = {
    'param1': 'value1',
    'param2': 'value2'
}

# 编码查询字符串
encoded_params = urlencode(params)

print(encoded_params)  # 输出: param1=value1&param2=value2

解码查询字符串：

from urllib.parse import parse_qs

query_string = 'param1=value1&param2=value2'

# 解码查询字符串
decoded_params = parse_qs(query_string)

print(decoded_params)  # 输出: {'param1': ['value1'], 'param2': ['value2']}

URL 编码/解码：

from urllib.parse import quote, unquote

string = 'Hello World!'

# URL 编码
encoded_string = quote(string)

print(encoded_string)  # 输出: Hello%20World%21

# URL 解码
decoded_string = unquote(encoded_string)

print(decoded_string)  # 输出: Hello World!