Python多线程/多进程处理数据

清梦星河_zh

已于 2023-09-12 15:34:29 修改

阅读量167

点赞数

文章标签： python 开发语言 pandas

于 2023-09-12 15:33:37 首次发布

本文链接：https://blog.csdn.net/zhangjing_angry/article/details/132832898

版权

python多线程/多进程，异步数据处理：

目前在使用的两种方式：
一、使用map的形式：

def singProcess(args):
    kps = getKeyPhrase('title', args[1])
    txl = '、'.join([i[0] for i in kps[:20]])
    return (args[0], txl)

# 方式一
t1 = time.time()
num_cores = multiprocessing.cpu_count()
t_list = [(index, df.loc[index, 'pure_content']) for index in range(18000,df.shape[0])]
with Pool(num_cores) as p:
    outputs = p.map(singProcess, t_list)
for ix in outputs:
    df.loc[ix[0],'key_word'] = ix[1]
df.to_csv(f'./crops/kw_2.csv', index=None)
print('test', time.time() - t1)

这种方式，目前看着好像只能对传入的数据进行自动遍历，无法使用代码进行遍历，
但是可以直接接受数据处理的返回值，内存的分配上也更加的符合常理，少出现内存不足的情况。
保留：如理解不到位，后续发现此处不正确的话，再行修改。

二、使用apply_aync的形式：

def singTest(df, start, end):
    for index in range(start, end):
        # print(index, start, end)
        kps = getKeyPhrase('title', df.loc[index, 'pure_content'])
        txl = '、'.join([i[0] for i in kps[:20]])
        df.loc[index, 'key_word'] = txl
    return df

# 方式二
t1 = time.time()
et,step = 18000,300
res = []
pool = multiprocessing.Pool(4)
for ix in range(0,et,step):
    endt = ix+step if ix+step < et else et
    print(ix,endt)
    # print(df.iloc[ix:endt,:])
    res.append(pool.apply_async(func=singTest,args=(df.iloc[ix:endt,:],ix,endt)))
pool.close()
pool.join()

#  用于接受数据处理的结果，apply_async如果直接在上面添加.get()方法，会导致进程堵塞，无法并行实行程序
arr = []
for tmp in res:
    tmpV = tmp.get()
    arr.append(tmpV)
    print(type(tmpV))
df = pd.concat(arr,axis=0)
df.to_csv(f'./crops/kw_tmp.csv', index=None)
print('test', time.time() - t1)

这种方式更加灵活一些，但是可能存在进程太多，导致内存不足，无法使用全部的CPU核数，会导致各种各样问题。
比如，实践中，第一种方法可使用8核，第二种使用8核会导致各种问题（内存不足，无法加载模型等等），只能使用4核并行。
但是，两种方式的执行效率应该是差不多的，程序的执行时间相差不多（也可能是处理的数据不够多，因此差别不明显）。

总结：第一种的使用方式更加简单一些，OVER

感谢解决方法提供：
1、异步变成了阻塞解决方法
2、多进程执行for循环的代码