多进程
处理大量数据时用多进程可以大大加快处理的速度。常见的有两个库,并且有多种写法
下面给出不同的写法,以及对比效果。
from concurrent import futures
import time
from multiprocessing import Pool
import json
from tqdm import tqdm
import math
def func(n):
time.sleep(1)
return n
#多进程的方法,64个workers,完成了就写入文件
def ProcessPool(num_workers, args, write_file):
writer = open(write_file, 'w')
with futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
fs = []
for arg in args:
fs.append(executor.submit(func, arg))
for f in tqdm(futures.as_completed(fs), total=len(fs)):
try:
info = {"path": f.result()}
writer.write(json.dumps(info, ensure_ascii=False) + '\n')
writer.flush()
except Exception as e:
print(e)
#笨蛋写法
#1、w+一起再写入,这样会很慢,因为空间复杂度极高!!!,应该逐个写入,并且flush()
#2、map注意对于很长的迭代对象,可能消耗很多内存。可以考虑使用 imap() 或 imap_unordered() 并且显示指定 chunksize 以提升效率。
def process(file_lists, write_file):
f_w = open(write_file, 'w')
print('bigin process, wait........')
with Pool(64) as p:
outs = p.map(func, file_lists)
print('bigin write, wait........')
w = ''
for out in outs:
anno = {"path": out}
w += json.dumps(anno, ensure_ascii=False) + '\n'
with open(write_file, 'w') as f_w:
f_w.write(w)
def proces_new1(file_lists, write_file):
print('bigin process, wait........')
f_w = open(write_file, 'w')
with Pool(64) as p:
for i in tqdm(range(math.ceil((len(file_lists)/64)))):
outs = p.map(func, file_lists[i*64:(i+1)*64])
for out in outs:
anno = {"path": out}
f_w.write(json.dumps(anno, ensure_ascii=False) + '\n')
f_w.flush()
def proces_new2(file_lists, write_file):
f_w = open(write_file, 'w')
print('bigin process, wait........')
outs = []
with Pool(64) as p:
for res in tqdm(p.imap(func, file_lists), total=len(file_lists)):
info = {"path": res}
f_w.write(json.dumps(info, ensure_ascii=False) + '\n')
f_w.flush()
if __name__ == "__main__":
lists = range(2000)
start = time.time()
write_file = 'out.txt'
process(lists, write_file)
time1 = time.time()
print('v1 speed: {}'.format(time1 - start))
write_file = 'out2.txt'
proces_new1(lists, write_file)
time2 = time.time()
print('v2 speed: {}'.format(time2 - time1))
write_file = 'out3.txt'
ProcessPool(64, lists, write_file)
time3 = time.time()
print('v3 speed: {}'.format(time3 - time2))
上述函数执行的输出
bigin process, wait........
bigin write, wait........
v1 speed: 33.38949918746948
bigin process, wait........
100%|██████████████████████████████████████████████| 32/32 [00:34<00:00, 1.08s/it]
v2 speed: 35.787909746170044
100%|███████████████████████████████████████████| 2000/2000 [00:32<00:00, 61.31it/s]
v3 speed: 34.20087647438049
当range(200000)的结果如下
bigin process, wait........
bigin write, wait........
v1 speed: 3321.7403123378754
bigin process, wait........
100%|██v2 speed: 3675.6794664859774<00:10, 1.01s/it]
100%|██████████| 3125/3125 [1:01:14<00:00, 1.18s/it]
100%|██████████| 200000/200000 [50:49<00:00, 65.59it/s]
v3 speed: 3198.1027359962463
用了200000次用来测试,其实可以用2000个,这样可以更快
当数据很少时,用v1就可以了,因为调用进程只有 一次,一次性写完。但是没有记时,写入信息不实时。
当数据量很大,建议用v2或者v3,v2和v1比多了进程调用的时间,好处是实时的写入了文件,而且用了tqdm记录时间
v3在数据量很大的时候是更快的方法,但是和v2比,v3写入的文件是乱序的。
最后附上一个等价v3的方法,用pool实现的,详见proces_new2。推荐用这个
其他:https://superfastpython.com/multiprocessing-pool-vs-processpoolexecutor/