Python多进程分片下载远端大文件 - multiprocessing paramiko

Python多进程分片下载远端大文件 ,可以按照以下流程设计代码框架:

  1. 导入需要的模块:首先,导入所需的模块,包括paramiko、os和multiprocessing。
  2. 创建下载函数:创建一个用于分片下载文件的函数。该函数将使用SSH连接到远程服务器,并使用SFTP协议下载文件的指定分片到本地路径。使用sftp_file.seek() 和 file.seek() 确保正确的块被下载
  3. 主函数:在主函数中,您需要设置远程服务器的主机名、用户名、密码、远程文件路径和本地存储路径。还需要确定要划分的分片数量。我们使用os.cpu_count()获取当前系统的CPU数量,并将文件分割成块,数量与CPU个数一致。我们创建一个Process实例,每个实例负责一个文件块的下载。我们再次使用process.join()等待所有进程完成,在完成后输出一条完成信息。
  4. 运行主函数:在主函数中,我们创建了一个进程池,并使用Process方法来并发地调用分片下载函数。通过确定每个进程所负责的起始字节和结束字节,从而实现对服务器端文件的分片下载。然后,我们关闭并等待进程池中的所有任务完成。

Python多进程分片下载远端大文件源码(多进程直接写在main函数里):

import paramiko
import multiprocessing
import time

def download_chunk_file(start_pos, end_pos, remote_path, local_path, ssh_info):
    print("download_chunk_file start")
    client = paramiko.SSHClient()
    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    client.connect(**ssh_info)

    sftp = client.open_sftp()
    # open both local and remote file
    local_file = open(local_path, "r+b")
    remote_file = sftp.open(remote_path, "rb")

    # right shift to same start position
    local_file.seek(start_pos)
    remote_file.seek(start_pos)

    #print("start_pos ~ end_pos: {} ~ {}".format(start_pos, end_pos))
    while True:
        # read chunk file from remote file
        buffer = remote_file.read(end_pos - start_pos)
        if not buffer:
            break
        # write chunk file to local file
        local_file.write(buffer)
    print("chunk file with start_pos ~ end_pos: {} ~ {}, Download successfully!".format(start_pos, end_pos))
    remote_file.close()
    local_file.close()
    client.close()
    print("download_chunk_file end")

def main():
    print("main start")
    host = "host"
    port = 22
    username = "username"
    password = "password"

    remote_path = '/remote_dir/remote_file'
    local_path = '/local_dir/local_file'

    ssh_info = {
        "hostname": host,
        "port": port,
        "username": username,
        "password": password,
    }

    client = paramiko.SSHClient()
    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    client.connect(**ssh_info)
    sftp = client.open_sftp()

    file_size = sftp.stat(remote_path).st_size

    sftp.close()
    client.close()

    # get number of CPU with high efficiency
    processes = multiprocessing.cpu_count()
    #processes = 1
    print("number of CPU is {}".format(processes))
    # calculate the chunk size, n-1 processing handle each chunk size sub-file, and last processing handle last remaining sub-file
    chunk_size = file_size // processes

    with open(local_path, "wb") as f:
        f.truncate(file_size)

    multiprocess_download_start = time.time()
    process_list = []
    for i in range(processes):
        #print("process: {}".format(i))
        start_pos = i * chunk_size
        end_pos = (i + 1) * chunk_size if (i + 1) * chunk_size < file_size else file_size
        # multi processing to function download_chunk_file
        p = multiprocessing.Process(target=download_chunk_file, args=(start_pos, end_pos, remote_path, local_path, ssh_info))
        p.start()
        print(p)
        process_list.append(p)

    # wait for all the processes to finish
    for p in process_list:
        p.join()
        print(p)
    multiprocess_download_end = time.time()
    multiprocess_download_cost = multiprocess_download_end - multiprocess_download_start
    print("Full file Download successfully! Cost: {:.2f}s".format(multiprocess_download_cost))
    print("main end")
if __name__ == "__main__":
    main()

Python多进程分片下载远端大文件源码(多进程单独封装函数):

import paramiko
import multiprocessing
import time

def get_remote_file_size(ssh_info, remote_path):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(**ssh_info)
    sftp = ssh.open_sftp()
    # get remote file size
    remote_file_size = sftp.stat(remote_path).st_size
    print ("remote_file_size:{}".format(remote_file_size))
    sftp.close()
    ssh.close()
    return remote_file_size

def download_chunk_file(ssh_info, remote_path, local_path, start_pos, end_pos):
    print("download_chunk_file start")
    client = paramiko.SSHClient()
    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    client.connect(**ssh_info)

    sftp = client.open_sftp()
    # open both local and remote file
    local_file = open(local_path, "r+b")
    remote_file = sftp.open(remote_path, "rb")

    # right shift to same start position
    local_file.seek(start_pos)
    remote_file.seek(start_pos)

    # print("start_pos ~ end_pos: {} ~ {}".format(start_pos, end_pos))
    while True:
        # read chunk file from remote file
        read_start = time.time()
        buffer = remote_file.read(end_pos - start_pos)
        if not buffer:
            break
        else:
            print("read  cost time {:.2f}s".format(time.time() - read_start))
        # write chunk file to local file
        write_start = time.time()
        local_file.write(buffer)
        print("write cost time {:.2f}s".format(time.time() - write_start))
    print("chunk file with start_pos ~ end_pos: {} ~ {}, Download successfully!".format(start_pos, end_pos))
    remote_file.close()
    local_file.close()
    client.close()
    print("download_chunk_file end")

def download_multiprocessing(ssh_info, remote_path, local_path):
    # get number of CPU with high efficiency
    num_processes = multiprocessing.cpu_count()
    #num_processes = 1
    print("number of CPU is {}, number of process is {}".format(multiprocessing.cpu_count(), num_processes))
    # get remote file size
    file_size = get_remote_file_size(ssh_info, remote_path)
    # create new empty local file, same size with remote file
    with open(local_path, "wb") as f:
        f.truncate(file_size)

    # calculate the chunk size, n-1 processing handle each chunk size sub-file, and last processing handle last remaining sub-file
    chunk_size = file_size // num_processes
    print("chunk_size is {}".format(chunk_size))
    # create number of process
    processes = []
    # create a process for each chunk
    for index in range(num_processes):
        #print("process: {}".format(index))
        start_pos = index * chunk_size
        end_pos = start_pos + chunk_size
        # last process will download the remaining bytes
        if index == num_processes - 1:
            end_pos = file_size - 1

        args = (ssh_info, remote_path, local_path, start_pos, end_pos)
        process = multiprocessing.Process(target=download_chunk_file, args=args)

        process.start()
        print(process)
        processes.append(process)

    # wait for all the processes to finish
    for process in processes:
        process.join()
        print(process)

def main():
    
    host = "host"
    port = 22
    username = "username"
    password = "password"

    remote_path = '/remote_dir/remote_file'
    local_path = '/local_dir/local_file'

    ssh_info = {
        "hostname": host,
        "port": port,
        "username": username,
        "password": password,
    }

    multiprocess_download_start = time.time()
    download_multiprocessing(ssh_info, remote_path, local_path)
    multiprocess_download_end = time.time()
    multiprocess_download_cost = multiprocess_download_end - multiprocess_download_start
    print("Full file Download successfully! Cost time: {:.2f}s".format(multiprocess_download_cost))


if __name__ == "__main__":
    main()

运行结果输出:

$ python multi_process_download_single_bigfile_def.py
number of CPU is 4
remote_file_size:63376366
chunk_size is 15844091
<Process(Process-1, started)>
download_chunk_file start
<Process(Process-2, started)>
download_chunk_file start
<Process(Process-3, started)>
<Process(Process-4, started)>
<Process(Process-1, started)>
download_chunk_file start
download_chunk_file start
read  cost time 6.19s
write cost time 0.01s
read  cost time 6.22s
write cost time 0.01s
read  cost time 6.20s
write cost time 0.01s
read  cost time 0.00s
write cost time 0.00s
read  cost time 0.00s
chunk file with start_pos ~ end_pos: 47532273 ~ 63376365, Download successfully!
download_chunk_file end
read  cost time 6.24s
write cost time 0.01s
read  cost time 4.25s
write cost time 0.01s
read  cost time 4.36s
write cost time 0.01s
read  cost time 4.34s
write cost time 0.01s
read  cost time 0.03s
write cost time 0.00s
read  cost time 0.00s
chunk file with start_pos ~ end_pos: 31688182 ~ 47532273, Download successfully!
download_chunk_file end
read  cost time 4.26s
write cost time 0.01s
read  cost time 0.00s
write cost time 0.00s
read  cost time 0.00s
chunk file with start_pos ~ end_pos: 15844091 ~ 31688182, Download successfully!
download_chunk_file end
read  cost time 4.39s
write cost time 0.01s
read  cost time 4.29s
write cost time 0.01s
read  cost time 0.00s
write cost time 0.00s
read  cost time 0.00s
chunk file with start_pos ~ end_pos: 0 ~ 15844091, Download successfully!
download_chunk_file end
<Process(Process-2, stopped)>
<Process(Process-3, stopped)>
<Process(Process-4, stopped)>
Full file Download successfully! Cost: 19.62s

参考:

Python paramiko文件传输显示上传下载进度信息 - print_Entropy-Go的博客-CSDN博客

Python paramiko实现文件的简单传输上传和下载代码_Entropy-Go的博客-CSDN博客

单进程处理时,建议直接使用getfo()函数下载,实测下载速度比read(), write()方法快很多。

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
### 回答1: Python进程multiprocessing)是一种并行计算的方式,可以在多个CPU核心上同时执行任务,提高程序的运行效率。它可以通过创建多个进程来实现并行计算,每个进程都有自己的独立内存空间,可以独立运行。在Python中,可以使用multiprocessing模块来实现多进程编程,它提供了一系列的类和函数,可以方便地创建和管理多个进程。使用multiprocessing模块可以有效地提高程序的运行速度,特别是在处理大量数据或计算密集型任务时,效果更加明显。 ### 回答2: Python进程multiprocessing是一个用来创建并行程序的模块,它允许Python程序同时运行多个进程,从而大幅提高程序的运行效率。在Python中,多进程可以通过fork()系统调用实现,但是由于在Windows上无法完全实现fork(),因此Python在Windows上仅支持多线程,而不支持多进程Python进程中最基本的概念是进程(Process)。在Python中,创建进程有两种基本的方式:使用Process类和使用Pool类。Process类允许我们创建一个进程对象,可以给进程传递参数并启动进程。Pool类则可以创建多个进程,并且自动分配任务到不同的进程中。 Python中还有一些重要的概念和技巧,例如进程间通信(IPC)、共享内存、信号量、锁等。进程间通信是在多进程应用程序中非常常见的技术,它可以让不同的进程之间进行数据交换、资源共享等操作。共享内存则允许不同的进程访问同一个内存区域,从而避免了复制数据的过程,提高了程序的效率。 对于使用Python进程multiprocessing进行开发的程序,我们需要借助Python标准库中的一些附加模块,例如queue、threading、signal等,以及一些第三方库,例如PyQt、numpy等。对于不同的应用场景,我们可以选择不同的库和技术,从而实现最佳的多进程方案。 总体而言,Python进程multiprocessing是一个非常强大的模块,可以大幅提高Python程序的运行效率和并发性能。不过,要想熟练地运用这个模块,需要对Python的多线程、操作系统和计算机架构等知识有一定的了解。同时,对于更高级别的多进程应用,可能还需要一些专业的领域知识。 ### 回答3: Python是一种面向对象、解释型、高级编程语言,其简洁、易读、易学的语法特点赢得了大批开发者的青睐。而multiprocessing则是Python的一个重要模块,可用于实现多进程并发编程,充分利用计算机多核心资源,提高Python程序的运行效率。 multiprocessing模块的主要特点包括: 1. 提供了Process、Pool、Queue等多个类和函数,方便实现进程的创建、管理和通信。 2. 可以使用fork进程创建新的进程,并可在进程之间共享数据,也可以将任务分配给多个进程执行,提高效率。 3. 提供了多种进程间通信机制的实现,如Pipe、Queue、Value和Array等,让进程之间的通信更加方便,同时保证数据的正确和安全。 4. 灵活的参数设置和控制,比如可以设置进程数、超时时间等,让程序更加可控和稳定。 使用multiprocessing模块可以带来显著的性能提升,特别适合需要处理大规模数据、密集计算或者需要高并发的应用场景。例如,在爬取大规模网站数据时,可以把每个网站的数据抓取任务分配给不同的进程去执行,节约大量时间。在图像处理和机器学习方面也可以大大加快程序的速度。 总之,Pythonmultiprocessing模块为我们提供了一种易用、高效的多进程编程方法,使我们能够更好地利用计算机资源,提高Python程序的性能和效率。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值