python 多进程 multiprocessing pool vs processpoolexecutor

yang_daxia

已于 2022-10-13 12:00:32 修改

阅读量1.2k

点赞数

分类专栏： python 文章标签： python 多进程

于 2022-10-12 15:46:35 首次发布

本文链接：https://blog.csdn.net/yang_daxia/article/details/127283995

版权

python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

多进程

处理大量数据时用多进程可以大大加快处理的速度。常见的有两个库，并且有多种写法
下面给出不同的写法，以及对比效果。

from concurrent import futures
import time
from multiprocessing import Pool
import json
from tqdm import tqdm
import math

def func(n):
    time.sleep(1)
    return n
    
#多进程的方法，64个workers，完成了就写入文件
def ProcessPool(num_workers, args, write_file):
    writer = open(write_file, 'w')
    with futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        fs = []
        for arg in args:
            fs.append(executor.submit(func, arg))

        for f in tqdm(futures.as_completed(fs), total=len(fs)):
            try:
                info = {"path": f.result()}
                writer.write(json.dumps(info, ensure_ascii=False) + '\n')
                writer.flush()
            except Exception as e:
                print(e)

#笨蛋写法
#1、w+一起再写入，这样会很慢，因为空间复杂度极高！！！，应该逐个写入，并且flush()
#2、map注意对于很长的迭代对象，可能消耗很多内存。可以考虑使用 imap() 或 imap_unordered() 并且显示指定 chunksize 以提升效率。
def process(file_lists, write_file):
    f_w = open(write_file, 'w')
    print('bigin process, wait........')
    with Pool(64) as p:
        outs = p.map(func, file_lists)
    print('bigin write, wait........')
    w = ''
    for out in outs:
        anno = {"path": out}
        w += json.dumps(anno, ensure_ascii=False) + '\n'

    with open(write_file, 'w') as f_w:
        f_w.write(w)

def proces_new1(file_lists, write_file):
    print('bigin process, wait........')
    f_w = open(write_file, 'w')
    with Pool(64) as p:
        for i in tqdm(range(math.ceil((len(file_lists)/64)))):
            outs = p.map(func, file_lists[i*64:(i+1)*64])

            for out in outs:
                anno = {"path": out}
                f_w.write(json.dumps(anno, ensure_ascii=False) + '\n')
            f_w.flush()

def proces_new2(file_lists, write_file):
    f_w = open(write_file, 'w')
    print('bigin process, wait........')
    outs = []
    with Pool(64) as p:
        for res in tqdm(p.imap(func, file_lists), total=len(file_lists)):
            info = {"path": res}
            f_w.write(json.dumps(info, ensure_ascii=False) + '\n')
            f_w.flush()
            
if __name__ == "__main__":
    lists = range(2000)
    start = time.time()
    write_file = 'out.txt'
    process(lists, write_file)
    time1 = time.time()
    print('v1 speed: {}'.format(time1 - start))
    write_file = 'out2.txt'
    proces_new1(lists, write_file)
    time2 = time.time()
    print('v2 speed: {}'.format(time2 - time1))
    write_file = 'out3.txt'
    ProcessPool(64, lists, write_file)
    time3 = time.time()
    print('v3 speed: {}'.format(time3 - time2))

上述函数执行的输出

bigin process, wait........
bigin write, wait........
v1 speed: 33.38949918746948
bigin process, wait........
100%|██████████████████████████████████████████████| 32/32 [00:34<00:00,  1.08s/it]
v2 speed: 35.787909746170044
100%|███████████████████████████████████████████| 2000/2000 [00:32<00:00, 61.31it/s]
v3 speed: 34.20087647438049

当range(200000)的结果如下
bigin process, wait........
bigin write, wait........
v1 speed: 3321.7403123378754
bigin process, wait........
100%|██v2 speed: 3675.6794664859774<00:10,  1.01s/it] 
100%|██████████| 3125/3125 [1:01:14<00:00,  1.18s/it]
100%|██████████| 200000/200000 [50:49<00:00, 65.59it/s]  
v3 speed: 3198.1027359962463

用了200000次用来测试，其实可以用2000个，这样可以更快
当数据很少时，用v1就可以了，因为调用进程只有一次，一次性写完。但是没有记时，写入信息不实时。
当数据量很大，建议用v2或者v3,v2和v1比多了进程调用的时间，好处是实时的写入了文件，而且用了tqdm记录时间
v3在数据量很大的时候是更快的方法，但是和v2比，v3写入的文件是乱序的。

最后附上一个等价v3的方法，用pool实现的，详见proces_new2。推荐用这个
其他：https://superfastpython.com/multiprocessing-pool-vs-processpoolexecutor/

yang_daxia

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
python 多进程 multiprocessing pool vs processpoolexecutor

其他：https://superfastpython.com/multiprocessing-pool-vs-processpoolexecutor/当数据量很大，建议用v2或者v3,v2和v1比多了进程调用的时间，好处是实时的写入了文件，而且用了tqdm记录时间。当数据很少时，用v1就可以了，因为调用进程只有一次，一次性写完。v3在数据量很大的时候是更快的方法，但是和v2比，v3写入的文件是乱序的。用了200000次用来测试，其实可以用2000个，这样可以更快。下面给出不同的写法，以及对比效果。
复制链接

扫一扫