python处理大量数据是越来越慢,Python多处理：为什么大的块大小会变慢？

最新推荐文章于 2021-02-09 01:14:28 发布

长迦

最新推荐文章于 2021-02-09 01:14:28 发布

阅读量304

点赞数

文章标签： python处理大量数据是越来越慢

I've been profiling some code using Python's multiprocessing module (the 'job' function just squares the number).

data = range(100000000)

n=4

time1 = time.time()

processes = multiprocessing.Pool(processes=n)

results_list = processes.map(func=job, iterable=data, chunksize=10000)

processes.close()

time2 = time.time()

print(time2-time1)

print(results_list[0:10])

One thing I found odd is that the optimal chunksize appears to be around 10k elements - this took 16 seconds on my computer. If I increase the chunksize to 100k or 200k, then it slows to 20 seconds.

Could this difference be due to the amount of time required for pickling being longer for longer lists? A chunksize of 100 elements takes 62 seconds which I'm assuming is due to the extra time required to pass the chunks back and forth between different processes.

解决方案

About optimal chunksize:

Having tons of small chunks would allow the 4 different workers to distribute the load more efficiently, thus smaller chunks would be desirable.

In the other hand, context changes related to processes add an overhead everytime a new chunk has to be processed, so less amount of context changes and therefore less chunks are desirable.

As both rules want different aproaches, a point in the middle is the way to go, similar to a supply-demand chart.

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

长迦

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python处理大量数据是越来越慢,Python多处理：为什么大的块大小会变慢？

I've been profiling some code using Python's multiprocessing module (the 'job' function just squares the number).data = range(100000000)n=4time1 = time.time()processes = multiprocessing.Pool(processes...
复制链接

扫一扫