python 再也不用为下载文件而愁了

最新推荐文章于 2023-03-30 17:30:33 发布

litchi125

最新推荐文章于 2023-03-30 17:30:33 发布

阅读量860

点赞数

分类专栏： python 文章标签： python 多线程文件下载

本文链接：https://blog.csdn.net/weixin_40950781/article/details/107299198

版权

python 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

python 下载文件的十一种方法

0x01 requests库
0x02 wegt库
0x03 下载重定向文件
0x04 分块下载超大文件
0x05 下载多个文件(并行/批量下载)
0x06 使用进度条进行下载
0x07 使用urllib下载网页
0x08 通过代理下载
0x09 使用urllib3
0x10 使用Boto3从S3下载文件
0x11 使用异步

0x01 requests库

使用requests模块的get方法，直接下载，适合文件较小的情况

import requests
base_url ="https://www.python.org/static/img/python-logo.png"
res = requests.get(base_url)
with open("./data/demo1.png","wb")as f:
    f.write(res.content)

0x02 wegt库

使用wget库的download方法，其中url指定来源，out指定输出位置

import wget
base_url ="https://www.python.org/static/img/python-logo.png"
wget.download(url=base_url,out="./data/demo2.png")

0x03 下载重定向文件

有时候会遇到，某些资源在浏览或者下载时使用了重定向，此时我们需要设置allow_redirects参数为真

import requests
base_url = "https://readthedocs.org/projects/python-guide/downloads/pdf/latest"
res = requests.get(url=base_url, allow_redirects = True)
with open("./data/demo3.pdf","wb")as f:
    f.write(res.content)

0x04 分块下载超大文件

在下载超大文件时，我们不能讲整个响应全部加载到内存中，可以通过字节流加载，设置stream参数为True，保存方式有两种：

chunk:是指定每次写入的大小，每次只写了1024byte
iter_content：一块一块的遍历要下载的内容
iter_lines：一行一行的遍历要下载的内容

import requests

base_url = 'https://readthedocs.org/projects/python-guide/downloads/pdf/latest'

res = requests.get(url=base_url,stream=True, allow_redirects = True)

# 方法一
with open('./data/demo4_1.pdf', 'wb')as f:
    # iter_content：一块一块的遍历要下载的内容
    # chunk是指定每次写入的大小，每次只写了1024byte
    for chunk in res.iter_content(chunk_size=1024):
        f.write(chunk)
# 方法二
with requests.get(url=base_url) as res:
    with open('./data/demo4_2.pdf', 'wb')as f:
        # iter_lines：一行一行的遍历要下载的内容
        for chunk in res.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

0x05 下载多个文件(并行/批量下载)

import requests
from time import time
from multiprocessing.pool import ThreadPool
urls = [("Event1", "https://www.python.org/events/python-events/805/"),
        ("Event2", "https://www.python.org/events/python-events/801/"),
        ("Event3", "https://www.python.org/events/python-events/790/"),
        ("Event4", "https://www.python.org/events/python-events/798/"),
        ("Event5", "https://www.python.org/events/python-events/807/"),
        ("Event6", "https://www.python.org/events/python-events/807/"),
        ("Event7", "https://www.python.org/events/python-events/757/"),
        ("Event8", "https://www.python.org/events/python-user-group/816/")]
def url_response(url):
    path, url = url
    r = requests.get(url, stream=True)
    with open("./data/"+path, 'wb') as f:
        for ch in r:
            f.write(ch)
# 不使用多线程
start = time()
for x in urls:
    url_response (x)
print(f"Time to download: {time() - start}")

# 使用多线程
start = time()
ThreadPool(9).imap_unordered(url_response, urls)
print(f"Time to download: {time() - start}")

0x06 使用进度条进行下载

使用进度条下载，可以打印下载进度

python 输出%，print("%%")

import sys
import requests
from clint.textui import progress
url = 'http://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf'
r = requests.get(url, stream=True)
with open("LearnPython.pdf", "wb") as f:
    total_length = int(r.headers.get('content-length'))
    datalength = 0
    for ch in progress.bar(r.iter_content(chunk_size = 1024), expected_size=(total_length/1024) + 1):
        if ch:
            datalength += len(ch)
            f.write(ch)
            done = int(50 * datalength / total_length)
            sys.stdout.write("\r[%s%s] %s%%" % ('=' * done, ' ' * (50 - done),done/50*100))
            sys.stdout.flush()

0x07 使用urllib下载网页

urllib为python的标准库

import urllib
urllib.request.urlretrieve('https://www.python.org/', './data/demo7.html')

0x08 通过代理下载

如果你需要使用代理下载你的文件，你可以使用urllib模块的ProxyHandler。请看以下代码：在这段代码中，我们创建了代理对象，并通过调用urllib的build_opener方法来打开该代理，并传入该代理对象。然后，我们创建请求来获取页面。

import urllib.request
myProxy = urllib.request.ProxyHandler({'http': '127.0.0.2'})

openProxy = urllib.request.build_opener(myProxy)

urllib.request.urlretrieve('https://www.python.org/', './data/demo7.html')

此外，你还可以按照官方文档的介绍来使用requests模块:

import requests
myProxy = { 'http': 'http://127.0.0.2:3001' }
requests.get("https://www.python.org/", proxies=myProxy)

0x09 使用urllib3

urllib3是urllib模块的改进版本，我们将通过使用urllib3来获取一个网页并将它存储在一个文本文件中。在处理文件时，我们使用了shutil模块。我们使用了urllib3的PoolManager ，它会跟踪必要的连接池。

import urllib3,shutil

base_url = "https://www.python.org/"
c = urllib3.PoolManager()

filename = "./data/demo9.txt"

with c.request('GET', base_url, preload_content=False) as res, open(filename, 'wb') as out_file:

    shutil.copyfileobj(res, out_file)

0x10 使用Boto3从S3下载文件

要从Amazon S3下载文件，你可以使用Python boto3模块。
在开始之前，你需要使用pip安装awscli模块:

pip install boto3

对于AWS配置，请运行以下命令：

aws configure

现在，输入您的详细信息为：

AWS Access Key ID [None]: (The access key)
AWS Secret Access Key [None]: (Secret access key)
Default region name [None]: (Region)
Default output format [None]: (Json)

要从Amazon S3下载文件，请导入boto3和botocore。Boto3是用于Python的Amazon SDK，用于访问Amazon Web服务（例如S3）。Botocore提供了与Amazon Web服务进行交互的命令行服务。
Botocore带有awscli。要安装boto3，请运行以下命令：

pip install boto3

现在，导入以下两个模块：

import boto3, botocore

从亚马逊下载文件时，我们需要三个参数：

桶的名字
您需要下载的文件的名称
下载后的文件名
初始化变量：

bucket = “bucketName”
file_name = “filename”
downloaded_file = “downloadedfilename”

现在，初始化一个变量以使用会话资源。为此，我们将调用resource()boto3 的方法并传递服务，即s3：

service = boto3.resource(‘s3’)

最后，使用download_file 方法下载文件并传递变量：

service.Bucket(bucket).download_file(file_name, downloaded_file)

0x11 使用异步

asyncio模块专注于处理系统事件。它围绕事件循环工作，该事件循环等待事件发生，然后对该事件做出反应。反应可能是调用另一个函数。此过程称为均匀处理。asyncio模块使用协程进行事件处理。要使用asyncio事件处理和协程功能，我们将导入asyncio模块：

import asyncio

现在，定义如下所示的异步协程方法：

async def coroutine():

    await my_func()

关键字async表示这是一个本地asyncio协程。在协程的主体内部，我们有await关键字，它返回某个值。还可以使用return关键字。

现在，让我们使用协同程序创建代码以从网络下载文件：

import asyncio
import urllib.request
async def coroutine(url):
    r = urllib.request.urlopen(url)
    filename = "./data/couroutine_downloads.txt"
    with open(filename, 'wb') as f:
        for ch in r:
            f.write(ch)
    print_msg = 'Successfully Downloaded'
    return print_msg
async def main_func(urls_to_download):
    co = [coroutine(url) for url in urls_to_download]
    downloaded, downloading = await asyncio.wait(co)
    for i in downloaded:
        print(i.result())
urls_to_download = ["https://www.python.org/events/python-events/801/",
                    "https://www.python.org/events/python-events/790/",
                    "https://www.python.org/events/python-user-group/816/",
                    "https://www.python.org/events/python-events/757/"]
eventLoop = asyncio.get_event_loop()
eventLoop.run_until_complete(main_func(urls_to_download))

在此代码中，我们创建了一个异步协程函数，该函数下载文件并返回一条消息。

然后，我们有另一个异步协程调用，main_func 它等待URL并排队所有URL。asyncio的等待功能等待协程完成。

现在要启动协程，我们必须使用get_event_loop()asyncio 的方法将协程放入事件循环中，最后，使用run_until_complete()asyncio 的方法执行事件循环。

参考：https://www.cnblogs.com/mswei/p/11653504.html
参考：http://ipuhn.fvhogvg.cn/xlvg
参考：https://mp.weixin.qq.com/s/dhW5u4_ww8DRiPNwUgvMfw

litchi125

关注

0
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
python 再也不用为下载文件而愁了

python 下载文件的十一种方法0x01 requests库0x02 wegt库0x03 下载重定向文件0x04 分块下载超大文件0x05 下载多个文件(并行/批量下载)0x06 使用进度条进行下载0x07 使用urllib下载网页0x08 通过代理下载0x09 使用urllib30x10 使用Boto3从S3下载文件0x11 使用异步0x01 requests库使用requests模块的get方法，直接下载，适合文件较小的情况import requestsbase_url ="https://w
复制链接

扫一扫