Python中urllib库用法详解

IT之一小佬

已于 2023-12-24 16:21:00 修改

阅读量4.3k

点赞数

分类专栏： python 文章标签： python 开发语言 url urllib

于 2022-05-05 17:00:46 首次发布

本文链接：https://blog.csdn.net/weixin_44799217/article/details/124591187

版权

python 专栏收录该内容

129 篇文章 40 订阅

订阅专栏

quote()

quote函数是urllib.parse模块中的一个方法，用于将字符串进行URL编码。URL编码是将URL中的非ASCII字符和一些特殊字符转换成特定的格式，以便于在URL中传输和处理。

quote方法的语法如下：

quote(string, safe='', encoding=None, errors=None)

参数说明：

string：需要进行URL编码的字符串。
safe：指定不需要编码的字符，默认为空。可以是一个字符串，也可以是一个字符集合。
encoding：指定编码方式，默认为None，即使用系统默认编码方式。
errors：指定编码错误处理方式，默认为None。

返回值是一个进行URL编码后的字符串。

示例代码：

from urllib.parse import quote

string = "中文"
encoded_string = quote(string)
print(encoded_string)  # 输出：%E4%B8%AD%E6%96%87

在上面的例子中，我们将字符串"中文"使用quote方法进行URL编码，得到的结果是"%E4%B8%AD%E6%96%87"。在URL中，非ASCII字符和一些特殊字符都会被转换成类似"%XX"的格式，其中XX是字符的ASCII码的十六进制表示。

quote方法还可以接受safe参数，用于指定不需要进行编码的字符。比如，我们可以指定字符集合"/"不需要进行编码：

示例代码：

from urllib.parse import quote

url_string = "http://example.com/path/to/file?name=张三"
encoded_string = quote(url_string)
print(encoded_string)  # 输出：http%3A//example.com/path/to/file%3Fname%3D%E5%BC%A0%E4%B8%89

encoded_string = quote(url_string, safe='/')
print(encoded_string)  # 输出：http%3A//example.com/path/to/file%3Fname%3D%E5%BC%A0%E4%B8%89

encoded_string = quote(url_string, safe=':')
print(encoded_string)  # 输出：http:%2F%2Fexample.com%2Fpath%2Fto%2Ffile%3Fname%3D%E5%BC%A0%E4%B8%89

encoded_string = quote(url_string, safe='/:')
print(encoded_string)  # 输出：http://example.com/path/to/file%3Fname%3D%E5%BC%A0%E4%B8%89

encoded_string = quote(url_string, safe='/:?')
print(encoded_string)  # 输出：http://example.com/path/to/file?name%3D%E5%BC%A0%E4%B8%89

运行结果：

urlencode()

urlencode 是 urllib.parse 模块中的一个函数，用于将字典或元组列表中的数据进行URL编码。

示例代码：

from urllib.parse import urlencode

data = {'name': '张三', 'age': 25, 'city': '北京'}
encoded_data = urlencode(data)
print(encoded_data)  # name=%E5%BC%A0%E4%B8%89&age=25&city=%E5%8C%97%E4%BA%AC

输出结果为：name=%E5%BC%A0%E4%B8%89&age=25&city=%E5%8C%97%E4%BA%AC

urlencode 函数将字典中的键值对转换为 URL 编码格式，并用 & 符号连接起来。其中，中文字符会被转换为 %E5%BC%A0%E4%B8%89 这样的URL编码形式。

你可以将 urlencode 生成的编码后的数据用于构建URL查询字符串或POST请求的参数。

urlparse()

使用urlparse库会将url分解成6部分，返回的是一个元组 (scheme, netloc, path, parameters, query, fragment)。可以再使用urljoin、urlsplit、urlunsplit、urlparse把分解后的url拼接起来。

def urlparse(url, scheme='', allow_fragments=True):
    """Parse a URL into 6 components:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
    Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""
    url, scheme, _coerce_result = _coerce_args(url, scheme)
    splitresult = urlsplit(url, scheme, allow_fragments)
    scheme, netloc, url, query, fragment = splitresult
    if scheme in uses_params and ';' in url:
        url, params = _splitparams(url)
    else:
        params = ''
    result = ParseResult(scheme, netloc, url, params, query, fragment)
    return _coerce_result(result)

注意：通过urlparse库返回的元组可以用来确定网络协议（HTTP、FTP等）、服务器地址、文件路径等。

示例代码：

from urllib.parse import urlparse


url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
print(url.netloc)

urlunparse()

使用urlunparse库将一个元组(scheme, netloc, path, parameters, query, fragment)组成一个具有正确格式的URL。

def urlunparse(components):
    """Put a parsed URL back together again.  This may result in a
    slightly different, but equivalent URL, if the URL that was parsed
    originally had redundant delimiters, e.g. a ? with an empty query
    (the draft states that these are equivalent)."""
    scheme, netloc, url, params, query, fragment, _coerce_result = (
                                                  _coerce_args(*components))
    if params:
        url = "%s;%s" % (url, params)
    return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))

示例代码：

from urllib.parse import urlparse, urlunparse


url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
url_join1 = urlunparse(url)
print(url_join1)

url_tuple = ("http", "www.baidu.com", "index.php", "", "username=dgw", "")
url_join2 = urlunparse(url_tuple)
print(url_join2)

urlsplit()

使用urlsplit库只要用来分析urlstring，返回包含5个参数的元组(scheme, netloc, path, query, fragment)。urlsplit()和urlparse()差不多。不过它不切分URL的参数。

def urlsplit(url, scheme='', allow_fragments=True):
    """Parse a URL into 5 components:
    <scheme>://<netloc>/<path>?<query>#<fragment>
    Return a 5-tuple: (scheme, netloc, path, query, fragment).
    Note that we don't break the components up in smaller bits
    (e.g. netloc is a single string) and we don't expand % escapes."""
    url, scheme, _coerce_result = _coerce_args(url, scheme)
    allow_fragments = bool(allow_fragments)
    key = url, scheme, allow_fragments, type(url), type(scheme)
    cached = _parse_cache.get(key, None)
    ......

示例代码：

from urllib.parse import urlparse, urlsplit


url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)

url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)

urlunsplit()

def urlunsplit(components):
    """Combine the elements of a tuple as returned by urlsplit() into a
    complete URL as a string. The data argument can be any five-item iterable.
    This may result in a slightly different, but equivalent URL, if the URL that
    was parsed originally had unnecessary delimiters (for example, a ? with an
    empty query; the RFC states that these are equivalent)."""
    scheme, netloc, url, query, fragment, _coerce_result = (
                                          _coerce_args(*components))
    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
        if url and url[:1] != '/': url = '/' + url

示例代码：

from urllib.parse import urlparse, urlsplit, urlunsplit

url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)

url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)

url3 = urlunsplit(url2)
print(url3)

url_tuple = ("http", "www.baidu.com", "index.php", "username=dgw", "")
url4 = urlunsplit(url_tuple)
print(url4)

运行结果：

urljoin()

urljoin()将一个基本URL和一个可能的相对URL连接起来，形成对后者的绝对地址。

注意：如果基本URL并非以字符/结尾的话，那么URL基地址最右边部分就会被这个相对路径所替换。

def urljoin(base, url, allow_fragments=True):
    """Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter."""
    if not base:
        return url
    if not url:
        return base

    base, url, _coerce_result = _coerce_args(base, url)
    ......

示例代码：

from urllib.parse import urljoin


url = urljoin('http://www.baidu.com/test/', 'index.php?username=dgw')
print(url)

url2 = urljoin('http://www.baidu.com/test', 'index.php?username=dgw')
print(url2)