Python多进程分片下载网页URL大文件 - multiprocessing requests

Entropy-Go

已于 2023-06-29 17:58:20 修改

阅读量434

点赞数

分类专栏：随笔文章标签： python 开发语言 pycharm

于 2023-06-28 16:54:46 首次发布

本文链接：https://blog.csdn.net/holyvslin/article/details/131440646

版权

随笔专栏收录该内容

60 篇文章 5 订阅

订阅专栏

该文章介绍了一种使用Python的多进程和requests库来分片下载大文件的方法。通过发送HEAD请求获取文件大小，计算分片，然后利用multiprocessing创建多个进程分别下载每个分片，提高下载速度。下载过程中，每个进程更新其下载的分片到本地文件，最终合并成完整文件。

摘要由CSDN通过智能技术生成

Python多进程分片下载网页URL大文件，可以按照以下流程设计代码框架：

首先，我们通过发送HEAD请求获取要下载文件的大小，并计出需要将文件分成多少个分片。
然后，我们创建一个包含所有分片信息的列表。每个分片的信息包括分片编号、起始位置、结束位置和要下载的URL。
接下来，我们使用multiprocessing.Process创建指定数量的进程，并将每个分片的信息传给download_file函数进行分片下载。每个进程都会执行download_file函数来下载其所负责的分片。
在download_file函数中，我们使用requests.get方法发送带有分片请求头的GET请求，通过stream=True参数以流的形式接收响应，然后将分片的二进制数据，使用file.seek函数保持与源文件位置一致，写入本地文件。
最后，下载完成，我们输出一条消息提示下载文件完成。

这样，通过使用多进程对文件进行分片下载的方式，我们可以加快下载速度，提高效率。

Python多进程分片下载远端大文件源码（多进程单独封装函数）：

import multiprocessing
import time
import requests


def download_file(start_pos, end_pos, url, file_name, file_size, shared_dict):
    headers = {'Range': 'bytes={}-{}'.format(start_pos, end_pos)}
    resp = requests.get(url, headers=headers, stream=True)
    with open(file_name, 'r+b') as file:
        file.seek(start_pos)
        for chunk in resp.iter_content(chunk_size=1024 * 1024):
            if chunk:
                file.write(chunk)
                shared_dict['total_bytes_downloaded'] += len(chunk)

    print("chunk file with start_pos ~ end_pos: {} ~ {}, Download successfully!".format(start_pos, end_pos))
    print("{}/{}".format(shared_dict['total_bytes_downloaded'], file_size))
    percent = float(shared_dict['total_bytes_downloaded']) * 100 / file_size
    print("Download %.2f%% of the file." % percent)

def download_multiprocessing(url, file_name, file_size, shared_dict):
    # get number of CPU with high efficiency
    num_processes = multiprocessing.cpu_count()
    #num_processes = 1
    print("number of CPU is {}, number of process is {}".format(multiprocessing.cpu_count(), num_processes))

    # create new empty local file, same size with remote file
    with open(file_name, "wb") as f:
        f.truncate(file_size)

    # calculate the chunk size, n-1 processing handle each chunk size sub-file, and last processing handle last remaining sub-file
    chunk_size = file_size // num_processes
    print("chunk_size is {}".format(chunk_size))
    # create number of process
    processes = []
    # create a process for each chunk
    print(range(num_processes))
    for index in range(num_processes):
        print("index:",index)
        #print("process: {}".format(index))
        start_pos = index * chunk_size
        end_pos = start_pos + chunk_size
        # last process will download the remaining bytes
        if index == num_processes - 1:
            end_pos = file_size - 1
            print("end_pos",end_pos)

        args = (start_pos, end_pos, url, file_name, file_size, shared_dict)
        process = multiprocessing.Process(target=download_file, args=args)

        process.start()
        print(process)
        processes.append(process)


    # wait for all the processes to finish
    for process in processes:
        process.join()
        print(process)

def main():

    url = 'http://URL/abc.zip'
    file_name = './requests_largefile.zip'
    response = requests.head(url)
    print(response)
    #file_size = int(response.headers.get('Content-Length', 0))
    file_size = int(requests.head(url).headers["Content-Length"])
    print("url file_size is {} bytes".format(file_size))

    manager = multiprocessing.Manager()
    shared_dict = manager.dict(total_bytes_downloaded=0)
    multiprocess_download_start = time.time()
    download_multiprocessing(url, file_name, file_size, shared_dict)
    multiprocess_download_end = time.time()
    multiprocess_download_cost = multiprocess_download_end - multiprocess_download_start
    print("{}/{}".format(shared_dict['total_bytes_downloaded'], file_size))
    print("Full file Download successfully! Cost time: {:.2f}s".format(multiprocess_download_cost))

if __name__ == "__main__":
    main()

运行结果输出：

$ python multi_process_download_url_bigfile_def_requests.py
<Response [200]>
url file_size is 368162450 bytes
number of CPU is 4, number of process is 4
chunk_size is 92040612
[0, 1, 2, 3]
('index:', 0)
<Process(Process-2, started)>
('index:', 1)
<Process(Process-3, started)>
('index:', 2)
<Process(Process-4, started)>
('index:', 3)
('end_pos', 368162449)
<Process(Process-5, started)>
chunk file with start_pos ~ end_pos: 92040612 ~ 184081224, Download successfully!
262144000/368162450
Download 71.20% of the file.
chunk file with start_pos ~ end_pos: 184081224 ~ 276121836, Download successfully!
291270053/368162450
Download 79.40% of the file.
chunk file with start_pos ~ end_pos: 276121836 ~ 368162449, Download successfully!
319347531/368162450
Download 87.03% of the file.
chunk file with start_pos ~ end_pos: 0 ~ 92040612, Download successfully!
358959344/368162450
Download 97.50% of the file.
<Process(Process-2, stopped)>
<Process(Process-3, stopped)>
<Process(Process-4, stopped)>
<Process(Process-5, stopped)>
358959344/368162450
Full file Download successfully! Cost time: 5.03s