Python多进程分片下载网页URL大文件 - multiprocessing requests

该文章介绍了一种使用Python的多进程和requests库来分片下载大文件的方法。通过发送HEAD请求获取文件大小,计算分片,然后利用multiprocessing创建多个进程分别下载每个分片,提高下载速度。下载过程中,每个进程更新其下载的分片到本地文件,最终合并成完整文件。
摘要由CSDN通过智能技术生成

Python多进程分片下载网页URL大文件 ,可以按照以下流程设计代码框架:

  1. 首先,我们通过发送HEAD请求获取要下载文件的大小,并计出需要将文件分成多少个分片。

  2. 然后,我们创建一个包含所有分片信息的列表。每个分片的信息包括分片编号、起始位置、结束位置和要下载的URL。

  3. 接下来,我们使用multiprocessing.Process创建指定数量的进程,并将每个分片的信息传给download_file函数进行分片下载。每个进程都会执行download_file函数来下载其所负责的分片。

  4. download_file函数中,我们使用requests.get方法发送带有分片请求头的GET请求,通过stream=True参数以流的形式接收响应,然后将分片的二进制数据,使用file.seek函数保持与源文件位置一致,写入本地文件。

  5. 最后,下载完成,我们输出一条消息提示下载文件完成。

这样,通过使用多进程对文件进行分片下载的方式,我们可以加快下载速度,提高效率。

Python多进程分片下载远端大文件源码(多进程单独封装函数):

import multiprocessing
import time
import requests


def download_file(start_pos, end_pos, url, file_name, file_size, shared_dict):
    headers = {'Range': 'bytes={}-{}'.format(start_pos, end_pos)}
    resp = requests.get(url, headers=headers, stream=True)
    with open(file_name, 'r+b') as file:
        file.seek(start_pos)
        for chunk in resp.iter_content(chunk_size=1024 * 1024):
            if chunk:
                file.write(chunk)
                shared_dict['total_bytes_downloaded'] += len(chunk)

    print("chunk file with start_pos ~ end_pos: {} ~ {}, Download successfully!".format(start_pos, end_pos))
    print("{}/{}".format(shared_dict['total_bytes_downloaded'], file_size))
    percent = float(shared_dict['total_bytes_downloaded']) * 100 / file_size
    print("Download %.2f%% of the file." % percent)

def download_multiprocessing(url, file_name, file_size, shared_dict):
    # get number of CPU with high efficiency
    num_processes = multiprocessing.cpu_count()
    #num_processes = 1
    print("number of CPU is {}, number of process is {}".format(multiprocessing.cpu_count(), num_processes))

    # create new empty local file, same size with remote file
    with open(file_name, "wb") as f:
        f.truncate(file_size)

    # calculate the chunk size, n-1 processing handle each chunk size sub-file, and last processing handle last remaining sub-file
    chunk_size = file_size // num_processes
    print("chunk_size is {}".format(chunk_size))
    # create number of process
    processes = []
    # create a process for each chunk
    print(range(num_processes))
    for index in range(num_processes):
        print("index:",index)
        #print("process: {}".format(index))
        start_pos = index * chunk_size
        end_pos = start_pos + chunk_size
        # last process will download the remaining bytes
        if index == num_processes - 1:
            end_pos = file_size - 1
            print("end_pos",end_pos)

        args = (start_pos, end_pos, url, file_name, file_size, shared_dict)
        process = multiprocessing.Process(target=download_file, args=args)

        process.start()
        print(process)
        processes.append(process)


    # wait for all the processes to finish
    for process in processes:
        process.join()
        print(process)

def main():

    url = 'http://URL/abc.zip'
    file_name = './requests_largefile.zip'
    response = requests.head(url)
    print(response)
    #file_size = int(response.headers.get('Content-Length', 0))
    file_size = int(requests.head(url).headers["Content-Length"])
    print("url file_size is {} bytes".format(file_size))

    manager = multiprocessing.Manager()
    shared_dict = manager.dict(total_bytes_downloaded=0)
    multiprocess_download_start = time.time()
    download_multiprocessing(url, file_name, file_size, shared_dict)
    multiprocess_download_end = time.time()
    multiprocess_download_cost = multiprocess_download_end - multiprocess_download_start
    print("{}/{}".format(shared_dict['total_bytes_downloaded'], file_size))
    print("Full file Download successfully! Cost time: {:.2f}s".format(multiprocess_download_cost))

if __name__ == "__main__":
    main()

运行结果输出:

$ python multi_process_download_url_bigfile_def_requests.py
<Response [200]>
url file_size is 368162450 bytes
number of CPU is 4, number of process is 4
chunk_size is 92040612
[0, 1, 2, 3]
('index:', 0)
<Process(Process-2, started)>
('index:', 1)
<Process(Process-3, started)>
('index:', 2)
<Process(Process-4, started)>
('index:', 3)
('end_pos', 368162449)
<Process(Process-5, started)>
chunk file with start_pos ~ end_pos: 92040612 ~ 184081224, Download successfully!
262144000/368162450
Download 71.20% of the file.
chunk file with start_pos ~ end_pos: 184081224 ~ 276121836, Download successfully!
291270053/368162450
Download 79.40% of the file.
chunk file with start_pos ~ end_pos: 276121836 ~ 368162449, Download successfully!
319347531/368162450
Download 87.03% of the file.
chunk file with start_pos ~ end_pos: 0 ~ 92040612, Download successfully!
358959344/368162450
Download 97.50% of the file.
<Process(Process-2, stopped)>
<Process(Process-3, stopped)>
<Process(Process-4, stopped)>
<Process(Process-5, stopped)>
358959344/368162450
Full file Download successfully! Cost time: 5.03s

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值