python执行效率有多低,基于numpy的计算效率低下的多重处理

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:

import time

import numpy

from multiprocessing import Pool

def test_func(i):

a = numpy.random.normal(size=1000000)

b = numpy.random.normal(size=1000000)

for i in range(2000):

a = a + b

b = a - b

a = a - b

return 1

t1 = time.time()

test_func(0)

single_time = time.time() - t1

print("Single time:", single_time)

n_par = 4

pool = Pool()

t1 = time.time()

results_async = [

pool.apply_async(test_func, [i])

for i in range(n_par)]

results = [r.get() for r in results_async]

multicore_time = time.time() - t1

print("Multicore time:", multicore_time)

print("Efficiency:", single_time / multicore_time)

When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?

Some additional info which may be useful:

I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).

when I run this I see n_par processes in top working at 100% CPU

if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

解决方案

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.

Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值