Python urllib.parse
Python3标准库中的urllib,用来处理各种协议下的url请求。其中parse模块用来解析url,主要包含如下方法:
__all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag",
"urlsplit", "urlunsplit", "urlencode", "parse_qs",
"parse_qsl", "quote", "quote_plus", "quote_from_bytes",
"unquote", "unquote_plus", "unquote_to_bytes",
"DefragResult", "ParseResult", "SplitResult",
"DefragResultBytes", "ParseResultBytes", "SplitResultBytes"]
urlparse
作用是将完整的url字符分解成6个部分:
urlparse -> (scheme, netloc, path, params, query, fragment)
- scheme = “https”
- netloc = “www.baidu.com”
- paht = “/topic/”
- params = “”
- query = “”
- fragment = “”
返回结果类似元组,可以通过访问属性的方式访问对应数据:
proxy_result = urlparse(url)
query = proxy_result.query
netloc = proxy_result.netloc
urlsplit
类似 urlparse
,区别在于,返回一个五元元组格式的结果,相比较 urlparse
少返回一个params,同样可以通过访问属性的方法获取相应结果,使用方法等同于 urlparse
。
'''
urlsplit -> (scheme, netloc, path, query, fragment)
'''
urlunparse
urlparse
的逆方法,用来将url碎片整合成完整的url。接收的参数为包含url结构的元组,不存在的结构则留空。可以跟 urlparse
配合使用。
'''
urlunparse(components) -> Put a parsed URL back together again.
'''
urlunsplit
urlsplit
的逆方法,可以与 urlsplit
方法配合使用。
'''
urlunsplit -> Combine the elements of a tuple as returned by urlsplit() into a complete URL as a string.
'''
urljoin
用来将域名和path结合起来,当path为完整url时,返回path
'''
urljoin -> Join a base URL and a possibly relative URL to form an absolute interpretation of the latter.
'''
urljoin(base, path)
urldefrag
将url分解为url 和fragment两部分,fragment为url中#后边的内容
urldefrag -> Removes any existing fragment from URL.Returns a tuple of the defragmented URL and the fragment.
unquote_to_bytes
将url解码为bytes类型,使用utf8解码
unquote_to_bytes -> unquote_to_bytes('abc%20def') -> b'abc def'.
unquote
对传入的字符串进行url解码,默认utf8,可以通过encoding参数修改
unquote -> unquote('abc%20def') -> 'abc def'.
unquote(string, encoding='utf-8', errors='replace')
parse_qs
将query string解析为字典格式:{query1: [value1,value2,…]}。值为列表,因为可能有一个参数多个值的情况。默认会把空格去掉,通过修改布尔参数 keep_blank_values
更改方式
parse_qs -> Parse a query given as a string argument.
parse_qs(qs, keep_blank_values=False, strict_parsing=False,
encoding='utf-8', errors='replace')
parse_qsl
将query string解析为二元元组格式:((query1: value1), …)
parse_qsl -> Parse a query given as a string argument.
parse_qsl(qs, keep_blank_values=False, strict_parsing=False,
encoding='utf-8', errors='replace')
unquote_plus
在使用 unquote
方法之前先将+替换为空格
unquote_plus('%7e/abc+def') -> '~/abc def'
quote
对字符进行url编码,默认使用utf8
quote('abc def') -> 'abc%20def'
quote(string, safe='/', encoding=None, errors=None)
quote_plus
使用 quote
之前先将字符中的空格替换为+
Like quote(), but also replace ' ' with '+', as required for quoting HTML form values.
quote_from_bytes
对bytes或bytesarray类型进行url编码,返回str
Like quote(), but accepts a bytes object rather than a str, and does
not perform string-to-bytes encoding.
urlencode
将字典或二元元组转换为query string
Encode a dict or sequence of two-element tuples into a URL query string.
urlencode(query, doseq=False, safe='', encoding=None, errors=None,
quote_via=quote_plus)
to_bytes
to_bytes(u"URL") --> 'URL'.
unwrap
unwrap('<URL:type://host/path>') --> 'type://host/path'.
splittype
splittype('type:opaquestring') --> 'type', 'opaquestring'.
splithost
splithost('//host[:port]/path') --> 'host[:port]', '/path'.
splituser
splituser('user[:passwd]@host[:port]') --> 'user[:passwd]', 'host[:port]'.
splitpasswd
splitpasswd('user:passwd') -> 'user', 'passwd'.
splithost
splitport('host:port') --> 'host', 'port'.
splitnport
Split host and port, returning numeric port.
splitquery
splitquery('/path?query') --> '/path', 'query'.
splittag
splittag('/path#tag') --> '/path', 'tag'.
splitattr
splitattr('/path;attr1=value1;attr2=value2;...') ->
'/path', ['attr1=value1', 'attr2=value2', ...].
splitvalue
splitvalue('attr=value') --> 'attr', 'value'.