Study「Python」：文件下载

最新推荐文章于 2024-01-25 21:20:34 发布

Snowbowღ

最新推荐文章于 2024-01-25 21:20:34 发布

阅读量395

点赞数 2

分类专栏： Python 小记录文章标签： python

本文链接：https://blog.csdn.net/qq_41297934/article/details/105281754

版权

小记录同时被 2 个专栏收录

23 篇文章 1 订阅

订阅专栏

Python

19 篇文章 4 订阅

订阅专栏

有时候从网络下载包括图片这样的文件是很常见的，这里介绍三种文件下载的方法。

一、urlretrieve函数

（一）urlretrieve函数简介

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

def urlretrieve(url, filename=None, reporthook=None, data=None):
    """
    Retrieve a URL into a temporary location on disk.
    将URL检索到磁盘上的临时位置。

    Requires a URL argument. If a filename is passed, it is used as
    the temporary file location. The reporthook argument should be
    a callable that accepts a block number, a read size, and the
    total file size of the URL target. The data argument should be
    valid URL encoded data.
    需要URL参数。如果传递了filename，则将其用作临时文件位置。reporthook参数应该是一个回调的函数，它接受URL目标的数据块编号、读取大小和总文件大小。data参数应该是有效的URL编码的数据。

    If a filename is passed and the URL points to a local resource,
    the result is a copy from local file to new file.
    如果传递了一个文件名，并且URL指向本地资源，则结果是从本地文件复制到新文件。

    Returns a tuple containing the path to the newly created
    data file as well as the resulting HTTPMessage object.
    返回一个元组，其中包含到新创建的数据文件的路径以及结果HTTPMessage对象。
    """

总结一下就是：

url：资源地址，可以指向本地资源（当指向本地资源时，该过程相当于复制文件到新的位置）
filename：本地路径（如果参数未指定，urllib.request 会生成一个临时文件来保存数据）
reporthook：是一个回调函数，该回调函数接受三个参数：数据块编号、读取大小和总文件大小（该函数可以用来显示当前的下载进度）
data：有效的URL编码的数据

（二）下载进度

上面说了，我们可以用 reporthook 来调用一个显示下载进度的程序，这里我们再回顾一下三个参数（分别用a、b、c表示）的含义：

数据块编号（a）：数据块编号顺序为 $\left \{ 0,1,2,...,n \right \}$ ，假设当前数据块编号为 2 ，显然，已经下载完成的数据块数量也为 2 ，所以我们可以简单地认为 a 就是已经下载完成的数据块数量
读取大小（b）：每次下载的数据块大小
总文件大小（c）：文件的总大小

一个简单的计算公式即可得到当前完成的下载量占全部文件的比例：

$per = \frac{a\times b}{c}$

常用百分比来表示进度，具体代码如下：

def schedule(a, b, c):
    # a:已经下载的数据块数量
    # b:每次下载的数据块大小
    # c:文件的总大小
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print('%.2f%%' % per)

（三）实例

以下载百度首页 .html 文件为例：

from urllib.request import urlretrieve


def schedule(a, b, c):
    # a:已经下载的数据块数量
    # b:每次下载的数据块大小
    # c:文件的总大小
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print('%.2f%%' % per)


if __name__ == '__main__':

    url = 'http://www.baidu.com'
    filepath = './baidu.html'
    urlretrieve(url, filepath, schedule)

二、urlopen函数

（一）urlopen函数介绍

urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):
    '''Open the URL url, which can be either a string or a Request object.
    打开URL URL，它可以是一个字符串，也可以是一个请求对象。

    *data* must be an object specifying additional data to be sent to
    the server, or None if no such data is needed.  See Request for
    details.
    *data*必须是一个指定要发送到服务器的额外数据的对象，如果不需要这样的数据，则为None。详情请见请求。

    urllib.request module uses HTTP/1.1 and includes a "Connection:close"
    header in its HTTP requests.
    urllib.request模块使用HTTP/1.1，并在其HTTP请求中包含一个“Connection:close”报头。

    The optional *timeout* parameter specifies a timeout in seconds for
    blocking operations like the connection attempt (if not specified, the
    global default timeout setting will be used). This only works for HTTP,
    HTTPS and FTP connections.
    可选的*timeout*参数为阻塞操作(如连接尝试)指定超时(以秒为单位)(如果未指定，将使用全局缺省超时设置)。这只适用于HTTP、HTTPS和FTP连接。

    If *context* is specified, it must be a ssl.SSLContext instance describing
    the various SSL options. See HTTPSConnection for more details.
    如果指定了*context*，那么它必须是一个ssl。描述各种SSL选项的SSLContext实例。参见HTTPSConnection了解更多细节。

    The optional *cafile* and *capath* parameters specify a set of trusted CA
    certificates for HTTPS requests. cafile should point to a single file
    containing a bundle of CA certificates, whereas capath should point to a
    directory of hashed certificate files. More information can be found in
    ssl.SSLContext.load_verify_locations().
    可选的*cafile*和*capath*参数为HTTPS请求指定一组受信任的CA证书。cafile应该指向一个包含一组CA证书的文件，而capath应该指向一个散列证书文件目录。更多信息可以在ssl.SSLContext.load_verify_locations()中找到。

    The *cadefault* parameter is ignored.
    忽略*cadefault*参数。

    This function always returns an object which can work as a context
    manager and has methods such as
    该函数总是返回一个对象，该对象可以作为上下文管理器使用，并且具有以下方法

    * geturl() - return the URL of the resource retrieved, commonly used to
      determine if a redirect was followed
      geturl() - 返回所检索资源的URL，通常用于确定是否遵循了重定向

    * info() - return the meta-information of the page, such as headers, in the
      form of an email.message_from_string() instance (see Quick Reference to
      HTTP Headers)
      info() - 以email.message_from_string()实例的形式返回页面的元信息，如标题(参见HTTP标题的快速引用)

    * getcode() - return the HTTP status code of the response.  Raises URLError
      on errors.
      getcode() - 返回响应的HTTP状态码。在错误上引发URLError。

    For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
    object slightly modified. In addition to the three new methods above, the
    msg attribute contains the same information as the reason attribute ---
    the reason phrase returned by the server --- instead of the response
    headers as it is specified in the documentation for HTTPResponse.
    对于HTTP和HTTPS url，此函数返回HTTP.client。HTTPResponse对象略有修改。除了上述三个新方法之外，msg属性还包含与reason属性相同的信息——服务器返回的原因短语——而不是响应标头，因为它在HTTPResponse文档中指定了响应标头。

    For FTP, file, and data URLs and requests explicitly handled by legacy
    URLopener and FancyURLopener classes, this function returns a
    urllib.response.addinfourl object.
    对于由遗留URLopener和FancyURLopener类显式处理的FTP、file和data url和请求，此函数将返回一个urllib.response。addinfourl对象。

    Note that None may be returned if no handler handles the request (though
    the default installed global OpenerDirector uses UnknownHandler to ensure
    this never happens).
    注意，如果没有处理程序处理请求，可能不会返回任何值(尽管默认安装的全局OpenerDirector使用UnknownHandler来确保不会发生这种情况)。

    In addition, if proxy settings are detected (for example, when a *_proxy
    environment variable like http_proxy is set), ProxyHandler is default
    installed and makes sure the requests are handled through the proxy.
    此外，如果检测到代理设置(例如，设置了*_proxy环境变量，如http_proxy)，则ProxyHandler是默认安装的，并确保通过代理处理请求。

    '''

（二）实例

from urllib.request import urlopen


def download(url):

    f = urlopen(url)
    html = f.read()
    with open("./baidu.html", "wb") as fb:
        fb.write(html)


if __name__ == '__main__':
    my_url = 'http://www.baidu.com'

    download(my_url)

三、get函数

（一）get函数介绍

requests.get(url, params=None, **kwargs)

def get(url, params=None, **kwargs):
    r"""Sends a GET request.
    发送一个GET请求。

    :param url: URL for the new :class:`Request` object.
     param url: 新:class: 'Request'对象的URL。
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
     param params: (可选)字典，元组或字节的列表发送在查询字符串:class: 'Request'。
    :param \*\*kwargs: Optional arguments that ``request`` takes.
     param \*\*kwargs: ‘request’接受的可选参数。
    :return: :class:`Response <Response>` object
     return: :class:`Response <Response>`对象
    :rtype: requests.Response
     rtype: requests.Response
    """

（二）实例

import requests


def download(url):

    r = requests.get(url)
    with open("./baidu.html", "w") as f:
        f.write(r.text)


if __name__ == '__main__':
    my_url = 'http://www.baidu.com'

    download(my_url)