Python HTTP库 urllib之 parse 解析链接

程序猿学习

已于 2023-08-03 09:44:07 修改

阅读量199

点赞数

分类专栏： Python 文章标签： python 爬虫

于 2023-08-02 08:00:40 首次发布

本文链接：https://blog.csdn.net/Mountain_tai_li/article/details/132074581

版权

Python 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

urlib库里还提供了parse模块，这个模块定义了处理URL的标准接口，例如实现URL各部分的抽取、合并以及链接转换。它支持如下协议的URL处理:file、ftp、gopher、hdl、http、https、imap、mailto、mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、sip、sips、snews、svn、svn+ssh、telnet和 wais。

下面为urlparse各模块的用法：

1、urlparse

urlparse可以实现URL的识别和分段，使用方法如下：

# ************************
# use method of urlparse
# ************************
from urllib.parse import urlparse
url = 'https://www.baidu.com/index.html;user?id=5#comment'
result = urlparse(url)
print(type(result))
print(result)

运行输出结果如下：

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

urlparse的解析结果是一个ParseResult对象，包含6个部分：scheme、netloc、path、param、query、fragment。

标准的URL格式：

scheme://netloc/path;params?query#fragment

urlparse 的API用法：

urlparse(urlstring, scheme='',allow_fragment=True)

urlparse有3个参数：

urlstring 即待解析的URL
scheme 指默认协议（例如 http 或 https 等）。如果待解析的 URL 没有带协议信息，就会将这个作为默认协议
allow_fragment 标志是否忽略 fragment。设置 False 标识 fragment 被忽略，fragment 会被解析为 path、params 或者 query 的一部分，而 fragment 部分为空。

2、urlunparse

urlunparse 方法与 urlparse 方法相反，用于构造URL。接收的参数是一个可迭代对象，长度必须为6。使用方法如下：

# ************************
# use method of urlunparse
# ************************
from urllib.parse import urlunparse
data = ['https','www.baidu.com','index.html','user','id=5','comment']
url = urlunparse(data)
print(url)

运行输出如下：

https://www.baidu.com/index.html;user?id=5#comment

3、urlsplit

urlsplit 与 urlparse 相似，不过它在解析时将 params 合并到 path 中，不再单独解析，只返回5个结果。使用方法如下：

# ************************
# use method of urlsplit
# ************************
from urllib.parse import urlsplit
url = 'https://www.baidu.com/index.html;user?id=5#comment'
result = urlsplit(url)
print(result)

运行输出如下：

SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

4、urlunsplit

urlunsplit 与 urlunparse 类似，也是用于构造URL。接收的参数是一个可迭代对象，长度必须为5。使用方法如下：

# ************************
# use method of urlunsplit
# ************************
from urllib.parse import urlunsplit
data = ['https','www.baidu.com','index.html;user','id=5','comment']
url = urlunsplit(data)
print(url)

运行输出如下：

https://www.baidu.com/index.html;user?id=5#comment

5、urljoin

urljoin 也是可以生成链接的一种方法，相对于 urlunparse 和 urlunsplit 来说不需要特定长度的对象，使用起来更加灵活。该方法具有两个参数 base_url 和新连接。它会分析base_url 的 scheme、netloc、path，并对新链接缺失的部分进行补充，如果新连接不缺失的话，base_url 不起作用。使用方法如下：

# ************************
# use method of urljoin
# ************************
from urllib.parse import urljoin
url1 = urljoin('http://www.baidu.com','index.html')
url2 = urljoin('http://www.baidu.com', 'https://www.google.com/index.html')
print(url1)
print(url2)

运行输出如下：

http://www.baidu.com/index.html
https://www.google.com/index.html

6、urlencode

urlencode 在构造 GET 请求时非常有用，使用方法如下：

# ************************
# use method of urlencode
# ************************
from urllib.parse import urlencode
params = {
    'username':'admin',
    'password':'123456'
}
base_url = 'https://www.baidu.com/index.html?'
url = base_url + urlencode(params)
print(url)

运行输出如下：

https://www.baidu.com/index.html?username=admin&password=123456

7、quote

quote 方法可以将汉字转化为 URL编码，避免乱码，使用方法如下：

# ************************
# use method of quote
# ************************
from urllib.parse import quote
keyword = '北京'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)

运行输出如下：

https://www.baidu.com/s?wd=%E5%8C%97%E4%BA%AC

8、unquote

unquote 可以实现对URL的解码，使用方法如下：

# ************************
# use method of unquote
# ************************
from urllib.parse import unquote
url = 'https://www.baidu.com/s?wd=%E5%8C%97%E4%BA%AC'
unqute_url = unquote(url)
print(unqute_url)

运行输出如下：