python 直方图每个bin中的值,有效地获取Python中直方图bin的索引

Short Question

I have a large 10000x10000 elements image, which I bin into a few hundred different sectors/bins. I then need to perform some iterative calculation on the values contained within each bin.

How do I extract the indices of each bin to efficiently perform my calculation using the bins values?

What I am looking for is a solution which avoids the bottleneck of having to select every time ind == j from my large array. Is there a way to obtain directly, in one go, the indices of the elements belonging to every bin?

Detailed Explanation

1. Straightforward Solution

One way to achieve what I need is to use code like the following (see e.g. THIS related answer), where I digitize my values and then have a j-loop selecting digitized indices equal to j like below

import numpy as np

# This function func() is just a placemark for a much more complicated function.

# I am aware that my problem could be easily sped up in the specific case of

# of the sum() function, but I am looking for a general solution to the problem.

def func(x):

y = np.sum(x)

return y

vals = np.random.random(1e8)

nbins = 100

bins = np.linspace(0, 1, nbins+1)

ind = np.digitize(vals, bins)

result = [func(vals[ind == j]) for j in range(1, nbins)]

This is not what I want as it selects every time ind == j from my large array. This makes this solution very inefficient and slow.

2. Using binned_statistics

The above approach turns out to be the same implemented in scipy.stats.binned_statistic, for the general case of a user-defined function. Using Scipy directly an identical output can be obtained with the following

import numpy as np

from scipy.stats import binned_statistics

vals = np.random.random(1e8)

results = binned_statistic(vals, vals, statistic=func, bins=100, range=[0, 1])[0]

3. Using labeled_comprehension

Another Scipy alternative is to use scipy.ndimage.measurements.labeled_comprehension. Using that function, the above example would become

import numpy as np

from scipy.ndimage import labeled_comprehension

vals = np.random.random(1e8)

nbins = 100

bins = np.linspace(0, 1, nbins+1)

ind = np.digitize(vals, bins)

result = labeled_comprehension(vals, ind, np.arange(1, nbins), func, float, 0)

Unfortunately also this form is inefficient and in particular, it has no speed advantage over my original example.

4. Comparison with IDL language

To further clarify, what I am looking for is a functionality equivalent to the REVERSE_INDICES keyword in the HISTOGRAM function of the IDL language HERE. Can this very useful functionality be efficiently replicated in Python?

Specifically, using the IDL language the above example could be written as

vals = randomu(s, 1e8)

nbins = 100

bins = [0:1:1./nbins]

h = histogram(vals, MIN=bins[0], MAX=bins[-2], NBINS=nbins, REVERSE_INDICES=r)

result = dblarr(nbins)

for j=0, nbins-1 do begin

jbins = r[r[j]:r[j+1]-1] ; Selects indices of bin j

result[j] = func(vals[jbins])

endfor

The above IDL implementation is about 10 times faster than the Numpy one, due to the fact that the indices of the bins do not have to be selected for every bin. And the speed difference in favour of the IDL implementation increases with the number of bins.

解决方案

I found that a particular sparse matrix constructor can achieve the desired result very efficiently. It's a bit obscure but we can abuse it for this purpose. The function below can be used in nearly the same way as scipy.stats.binned_statistic but can be orders of magnitude faster

import numpy as np

from scipy.sparse import csr_matrix

def binned_statistic(x, values, func, nbins, range):

'''The usage is nearly the same as scipy.stats.binned_statistic'''

N = len(values)

r0, r1 = range

digitized = (float(nbins)/(r1 - r0)*(x - r0)).astype(int)

S = csr_matrix((values, [digitized, np.arange(N)]), shape=(nbins, N))

return [func(group) for group in np.split(S.data, S.indptr[1:-1])]

I avoided np.digitize because it doesn't use the fact that all bins are equal width and hence is slow, but the method I used instead may not handle all edge cases perfectly.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值