Python3模块详解--老司机工具urllib模块详解之urllib.parse子模块

最新推荐文章于 2024-07-06 01:58:07 发布

郑小源

最新推荐文章于 2024-07-06 01:58:07 发布

阅读量1.4w

点赞数 5

分类专栏： python 文章标签： Python urllib

本文链接：https://blog.csdn.net/zly412934578/article/details/77776659

版权

python 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
这是官方API上对这个模块的解释：这个模块是一个能把URL字符串拆分成组件，能把组件合并成URL和将一个相对的URL转成一个抽象的URL，从而的到一个基本的URL标准格式。
简单的说就是可以拆分URL，也可以拼接URL，他支持的URL格式为：file、ftp、gopher、hdl、http、https、imap、mailto，mms、news、nntp、prospero、rsync、rtsp、rtspu、sftp、shttp、sip、sips、snews、svn、svn+ssh、telnet、wais、ws、wss。
这个模块默认分为两个类别，URL parsing（URL解析）和 URL quoting（URL引用）
（一）、URL parsing（URL解析）
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
URL parsing函数注重将URL字符串分成组件，或者将组件合并成一个URL
urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)
函数用于将一个URL解析成六个部分，返回一个元组，URL的格式为：scheme://netloc/path;parameters?query#fragment；包含六个部分，元组中每一个元素都是一个字符串，可以为空，这六个部分均不能再被分割成更小的部分；

以下为返回的元组元素：

元素	编号	值	值不存在时默认值
scheme	0	请求	一定存在
netloc	1	网址	空字符串
path	2	分层路径	空字符串
params	3	参数	空字符串
query	4	查询组件	空字符串
fragment	5	标识符	空字符串
username		用户名	None
password		密码	None
hostname		主机名	None
port		端口号	None

示例如下：

import urllib.parse

print(urllib.parse.urlparse("https://www.zhihu.com/question/50056807/answer/223566912"))

输出结果：

ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', params='', query='', fragment='')

urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding=’utf-8’, errors=’replace’)

这个函数主要用于分析URL中query组件的参数，返回一个key-value对应的字典格式；

示例如下：

import urllib.parse
print(urllib.parse.parse_qs("FuncNo=9009001&username=1"))

输出结果：

{'FuncNo': ['9009001'], 'username': ['1']}

urllib.parse.parse_qsl(qs, keep_blank_values=False, strict_parsing=False, encoding=’utf-8’, errors=’replace’)

这个函数和 urllib.parse. parse_qs（） 作用一样，唯一的区别就是这个函数返回值是list形式；

示例如下：

import urllib.parse
print(urllib.parse.parse_qsl("FuncNo=9009001&username=1"))

输出结果

[('FuncNo', '9009001'), ('username', '1')]

urllib.parse. urlunparse ( parts )

这个函数可以将urlparse（）分解出来的元组组装成URL；

示例如下：

import urllib.parse
# print(urllib.parse.parse_qsl("FuncNo=9009001&username=1"))
parsed=urllib.parse.urlparse("https://www.zhihu.com/question/50056807/answer/223566912")
print(parsed)
# print(urllib.parse.parse_qs("https://www.zhihu.com/question/50056807/answer/223566912"))
# print(urllib.parse.parse_qs("FuncNo=9009001&username=1"))
t=parsed[:]
print(urllib.parse.urlunparse(t))

输出结果：

ParseResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', params='', query='', fragment='')
https://www.zhihu.com/question/50056807/answer/223566912

urllib.parse. urlsplit ( urlstring , scheme=” , allow_fragments=True )
这个函数和urlparse()功能类似，唯一的区别是这个函数不会将url中的param分离出来；就是说相比urlparse()少一个param元素，返回的元组元素参照urlparse()的元组表，少了一个param元素；

示例如下：

import urllib.parse
print(urllib.parse.urlsplit("https://www.zhihu.com/question/50056807/answer/223566912"))

输出结果：

SplitResult(scheme='https', netloc='www.zhihu.com', path='/question/50056807/answer/223566912', query='', fragment='')

urllib.parse.urlunsplit(parts)

与urlunparse()相似，切与urlsplit()相对应；

示例如下：

import urllib.parse
parsed=urllib.parse.urlsplit("https://www.zhihu.com/question/50056807/answer/223566912")
t=parsed[:]
print(urllib.parse.urlunsplit(t))

输出结果：

https://www.zhihu.com/question/50056807/answer/223566912

urllib.parse.urljoin(base, url, allow_fragments=True)

这个函数用于讲一个基本的URL和其他的URL组装成成一个完成的URL；

示例如下：

import urllib.parse
print(urllib.parse.urljoin("https://www.baidu.com/Python.html","Java.html"))

输出结果：

https://www.baidu.com/Java.html

注意：如果URL是一个抽象的URL（例如以“//”或“scheme://”开头），这个URL的主机名或请求标识会自动返回；

示例如下：

import urllib.parse
print(urllib.parse.urljoin("https://www.baidu.com/Python.html","//www.zhihu.com/Java.html"))

输出结果：

https://www.zhihu.com/Java.html

urllib.parse. urldefrag ( url )

如果URL中包含fragment标识，就会返回一个不带fragment标识的URL，fragment标识会被当成一个分离的字符串返回；如果URL中不包含fragment标识，就会返回一个URL和一个空字符串；

示例如下：

import urllib.parse
print(urllib.parse.urldefrag("http://user123:pwd@NetLoc:80/path;param?query=arg#frag"))
print(urllib.parse.urldefrag("http://user123:pwd@NetLoc:80/path;param?query=arg"))

输出结果：

DefragResult(url='http://user123:pwd@NetLoc:80/path;param?query=arg', fragment='frag')
DefragResult(url='http://user123:pwd@NetLoc:80/path;param?query=arg', fragment='')

（二）、URL quoting（URL引用）
The URL quoting functions focus on taking program data and making it safe for use as URL components by quoting special characters and appropriately encoding non-ASCII text. They also support reversing these operations to recreate the original data from the contents of a URL component if that task isn’t already covered by the URL parsing functions above
这个模块的主要作用就是通过引入合适编码和特殊字符对URL进行安全重构，并且可以反向解析。

urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)
第一个参数是URL，第二个参数是安全的字符串，即在加密的过程中，该类字符不变。默认为“/”；

示例如下：

import urllib.parse
url="https://www.zhihu.com/question/50056807/answer/223566912"
print(urllib.parse.quote(url))
print(urllib.parse.quote(url,safe=":"))

输出结果：

https%3A//www.zhihu.com/question/50056807/answer/223566912
https:%2F%2Fwww.zhihu.com%2Fquestion%2F50056807%2Fanswer%2F223566912

urllib.parse. quote_plus ( string , safe=” , encoding=None , errors=None )
这个函数和quote()相似，但是这个函数能把空格转成加号，并且safe的默认值为空
示例如下：

import urllib.parse
url="https://www.zhihu.com/ question/50056807/answer/223566912"
print(urllib.parse.quote_plus(url))

输出结果：

https%3A%2F%2Fwww.zhihu.com%2F+question%2F50056807%2Fanswer%2F223566912

urllib.parse. quote_from_bytes ( bytes , safe=’/’ )
和quote()相似，但是接收的是字节而不是字符；

urllib.parse.unquote(string, encoding=’utf-8’, errors=’replace’)、urllib.parse.unquote_plus(string, encoding=’utf-8’, errors=’replace’)、urllib.parse.unquote_to_bytes(string)

这三个函数分别与上面的quote()、quote_plus()、quote_from_bytes()相对应，解析相应函数处理过的URL；

urllib.parse.urlencode(query, doseq=False, safe=”, encoding=None, errors=None, quote_via=quote_plus)

这个函数主要用于接收map类型或两个序列元素的元组，从而将数据拼接成参数，结果返回的是key=value形式，并且多个参数用&分离，即key1=value1&key2=value2...格式

郑小源

关注

5
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
Python3模块详解--老司机工具urllib模块详解之urllib.parse子模块

This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL
复制链接

扫一扫

专栏目录