python array太慢_Python-Masked Numpy数组比普通numpy数组慢得多

最新推荐文章于 2023-04-04 17:54:01 发布

weixin_39589923

最新推荐文章于 2023-04-04 17:54:01 发布

阅读量299

点赞数

文章标签： python array太慢

我有一个函数来计算numpy数组中所有行对之间的成对相关性.一切正常,但是后来我想起,我经常不得不处理丢失的数据.我使用蒙版数组尝试解决此问题,但它使计算速度大大降低.关于使用屏蔽函数的任何想法.我认为真正的瓶颈将在np.ma.dot函数中.但是我添加了一些配置文件,并很快使用iPython进行了模拟.我应该说,就这些阵列将要拥有的行数而言,5000在频谱的较低端.有些可能超过30万.带掩码数组的代码看起来比没有掩码的代码慢大约20倍,但是显然由于缺少数据,没有掩码数组的代码经常会产生NaN.

首先,一种快速而肮脏的方式来生成一些样本数据

genotypes = np.empty((5000,200))

for i in xrange(5000):

genotypes[i] = np.random.binomial(3,.333, size=200)

pValues = np.random.uniform(0,1,5000)

然后测试功能

def testMask(pValsArray, genotypeArray):

nSNPs = len(pValsArray)-1

genotypeArray = np.ma.masked_array(genotypeArray, np.isnan(genotypeArray))

chisq = np.sum(-2 * np.log(pValsArray))

ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]

datam = genotypeArray - ms

datass = np.ma.sqrt(np.ma.sum(datam**2, axis=1))

runningSum = 0

for i in xrange(nSNPs):

temp = np.ma.dot(datam[i:],datam[i].T)

d = (datass[i:]*datass[i])

rs = temp / d

rs = np.absolute(rs)[1:]

runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)

sigmaSq = 4*nSNPs+2*runningSum

E = 2*nSNPs

df = (2*(E*E))/sigmaSq

runningSum = sigmaSq/(2*E)

d = chisq/runningSum

brownsP = stats.chi2.sf(d, df)

return brownsP

def testNotMask(pValsArray, genotypeArray):

nSNPs = len(pValsArray)-1

chisq = np.sum(-2 * np.log(pValsArray))

ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]

datam = genotypeArray - ms

datass = np.sqrt(stats.ss(datam, axis=1))

runningSum = 0

for i in xrange(nSNPs):

temp = np.dot(datam[i:],datam[i].T)

d = (datass[i:]*datass[i])

rs = temp / d

rs = np.absolute(rs)[1:]

runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)

sigmaSq = 4*nSNPs+2*runningSum

E = 2*nSNPs

df = (2*(E*E))/sigmaSq

runningSum = sigmaSq/(2*E)

d = chisq/runningSum

brownsP = stats.chi2.sf(d, df)

return brownsP

还有一些时间

%timeit testMask(pValues, genotypes)

1 loops, best of 3: 14.3 s per loop

%timeit testNotMask(pValues, genotypes)

1 loops, best of 3: 678 ms per loop

添加一些丢失的数据,然后再次运行：

randis = np.random.randint(0,5000, 10)

randjs = np.random.randint(0,200, 10)

for i,j in zip(randis, randjs):

genotypes[i,j] = None

%timeit testMask(pValues, genotypes)

1 loops, best of 3: 14.2 s per loop

%timeit testNotMask(pValues, genotypes)

1 loops, best of 3: 654 ms per loop

和一些分析：

%prun

2559677 function calls in 15.045 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)

9791 5.259 0.001 5.259 0.001 {method 'copy' of 'numpy.ndarray' objects}

4999 2.877 0.001 11.888 0.002 extras.py:949(dot)

9794 1.566 0.000 1.566 0.000 {numpy.core.multiarray.copyto}

14997 1.497 0.000 1.564 0.000 {numpy.core._dotblas.dot}

30007 0.559 0.000 0.559 0.000 {method 'reduce' of 'numpy.ufunc' objects}

94996 0.450 0.000 0.875 0.000 core.py:2751(_update_from)

864970 0.347 0.000 0.347 0.000 {getattr}

5000 0.346 0.000 0.802 0.000 core.py:1065(__call__)

1 0.240 0.240 15.045 15.045 :1(testMask)

5000 0.196 0.000 0.196 0.000 core.py:771(__call__)

5001 0.147 0.000 0.551 0.000 core.py:917(__call__)

24996 0.143 0.000 0.609 0.000 core.py:2930(__getitem__)

54998 0.140 0.000 0.775 0.000 core.py:2776(__array_finalize__)

419985 0.126 0.000 0.126 0.000 {method 'update' of 'dict' objects}

1 0.093 0.093 0.111 0.111 core.py:5990(power)

339994 0.077 0.000 0.077 0.000 {isinstance}

50015 0.072 0.000 0.072 0.000 {numpy.core.multiarray.array}

60002 0.060 0.000 0.568 0.000 {method 'view' of 'numpy.ndarray' objects}

5000 0.060 0.000 0.199 0.000 core.py:2626(__new__)

14999 0.058 0.000 7.412 0.000 core.py:3341(filled)

25005 0.055 0.000 0.092 0.000 core.py:609(getdata)

编辑：

我尝试了perimosocordiae的答案,但我仍然感到烦恼.看起来像平均值,stats.ss和np.sqrt函数都关心nan值.

def fastNotMask(pValsArray, genotypeArray):

nSNPs = len(pValsArray)-1

chisq = np.sum(-2 * np.log(pValsArray))

ms = genotypeArray.mean(axis=1)[(slice(None,None,None),None)]

print ms

datam = genotypeArray - ms

print datam

datass = np.sqrt(stats.ss(datam, axis=1))

print datass

runningSum = 0

for i in xrange(nSNPs):

temp = np.dot(datam[i:],datam[i].T)

d = (datass[i:]*datass[i])

rs = temp / d

rs = np.absolute(rs)[1:]

runningSum += 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs)

print runningSum

sigmaSq = 4*nSNPs+2*runningSum

E = 2*nSNPs

df = (2*(E*E))/sigmaSq

runningSum = sigmaSq/(2*E)

d = chisq/runningSum

brownsP = stats.chi2.sf(d, df)

return brownsP

用少量输出进行测试表明,nans没有被正确处理.

pValues = np.random.uniform(0,1,10)

genotypes = np.empty((10,10))

for i in xrange(10):

genotypes[i] = np.random.binomial(2,.5, size=10)

randis = np.random.randint(0,10, 2)

randjs = np.random.randint(0,10, 2)

for i,j in zip(randis, randjs):

genotypes[i,j] = None

print testfastMask(pValues, genotypes)

[[ 1.5]

[ 1.2]

[ 0.9]

[ 1.2]

[ 1.1]

[ 0.6]

[ nan]

[ 1.1]

[ nan]

[ 0.8]]

[[-0.5 -0.5 0.5 0.5 -0.5 -0.5 0.5 0.5 -0.5 0.5]

[-0.2 0.8 -0.2 -0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2]

[-0.9 0.1 -0.9 1.1 0.1 -0.9 1.1 0.1 0.1 0.1]

[-0.2 -0.2 -0.2 -0.2 0.8 -0.2 -0.2 -1.2 0.8 0.8]

[-0.1 -0.1 -0.1 -0.1 -0.1 -0.1 0.9 -0.1 0.9 -1.1]

[-0.6 1.4 0.4 -0.6 -0.6 0.4 -0.6 -0.6 0.4 0.4]

[ nan nan nan nan nan nan nan nan nan nan]

[-0.1 0.9 -1.1 -1.1 0.9 -0.1 0.9 -1.1 0.9 -0.1]

[ nan nan nan nan nan nan nan nan nan nan]

[ 1.2 -0.8 -0.8 0.2 -0.8 0.2 1.2 0.2 0.2 -0.8]]

[ 1.58113883 1.26491106 2.21359436 1.8973666 1.70293864 2.0976177

nan 2.62678511 nan 2.36643191]

nan

我在这里想念什么吗？这可能是版本问题.我正在使用python 2.7和numpy 1.7.1？

为帮助加油.

解决方法:

编辑：原始答案不适用于numpy版本< = 1.8,其中np.nansum([NaN,NaN])== 0.0(note the FutureWarning).对于早期版本,您必须手动检查这种情况：

tmp = 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs)

if not np.isnan(tmp):

runningSum += tmp

或者,您可以建立一个tmp值列表/数组,并在其上调用np.nansum.

原始答案：

您需要更改的只是testNotMask中的这一行：

runningSum += 3.25 * np.sum(rs) + .75 * np.dot(rs, rs)

对此：

runningSum += 3.25 * np.nansum(rs) + .75 * np.nansum(rs * rs)

所有其他操作都可以使用NaN值正常工作,因此您可以在保持正确结果的同时获得所有非掩码数组的速度.

这是证明.函数fastNotMask只是带有上述更改的testNotMask.

In [63]: genotypes = np.random.binomial(3, .333, size=(5000, 200)).astype(float)

In [64]: pValues = np.random.uniform(0,1,5000)

In [65]: %timeit testMask(pValues, genotypes)

1 loops, best of 3: 11.3 s per loop

In [66]: %timeit testNotMask(pValues, genotypes)

1 loops, best of 3: 3.53 s per loop

In [67]: %timeit fastNotMask(pValues, genotypes)

1 loops, best of 3: 3.96 s per loop

In [68]: randjs = np.random.randint(0,200, 10)

In [69]: randis = np.random.randint(0,5000,10)

In [70]: genotypes[randis,randjs] = None

In [71]: %timeit testMask(pValues, genotypes)

1 loops, best of 3: 33 s per loop

In [72]: %timeit testNotMask(pValues, genotypes)

1 loops, best of 3: 3.6 s per loop

In [73]: %timeit fastNotMask(pValues, genotypes)

1 loops, best of 3: 3.98 s per loop

In [74]: testMask(pValues, genotypes)

Out[74]: 0.47606794747438386

In [75]: testNotMask(pValues, genotypes)

Out[75]: nan

In [76]: fastNotMask(pValues, genotypes)

Out[76]: 0.47613597091679449

请注意,testMask和fastNotMask之间的精度略有不同.我实际上不确定这是哪里来的,但我将假定它并不重要.

标签：scipy,nan,python,numpy

weixin_39589923

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python array太慢_Python-Masked Numpy数组比普通numpy数组慢得多

我有一个函数来计算numpy数组中所有行对之间的成对相关性.一切正常,但是后来我想起,我经常不得不处理丢失的数据.我使用蒙版数组尝试解决此问题,但它使计算速度大大降低.关于使用屏蔽函数的任何想法.我认为真正的瓶颈将在np.ma.dot函数中.但是我添加了一些配置文件,并很快使用iPython进行了模拟.我应该说,就这些阵列将要拥有的行数而言,5000在频谱的较低端.有些可能超过30万.带掩码数组的...
复制链接

扫一扫