python 关联表_Python列联表

I am generating many, many contingency tables as part of a project that I'm writing.

The workflow is:

Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)

Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution

For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur

Use the contingency table to do some information theory math

Initially, I wrote this as:

def make_table(x, y, num_bins):

ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))

for xn, yn in zip(x, y):

ctable[xn, yn] += 1

return ctable

This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.

The fastest python-only optimization I've been able to come up with is this:

def make_table(x, y, num_bins):

ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))

reindex = np.dot(np.stack((x, y)).transpose(),

np.array([num_bins, 1]))

idx, count = np.unique(reindex, return_counts=True)

for i, c in zip(idx, count):

ctable[i] = c

return ctable.reshape((num_bins, num_bins))

That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?

Also, here's a benchmarking function.

def timetable(func):

size = 5000

bins = 10

repeat = 1000

start = time.time()

for i in range(repeat):

x = np.random.randint(0, bins, size=size)

y = np.random.randint(0, bins, size=size)

func(x, y, bins)

end = time.time()

print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))

解决方案

The clever trick for representing the elements of np.stack((x, y)) as integers can be made faster:

In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))

109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [94]: %timeit bins*x + y

12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Moreover, the last part of your second solution can be simplified somewhat, by simply considering

np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

What is more, since we are dealing with equally spaced non-negative integers, np.bincount will outperform np.unique; with that, the above boils down to

np.bincount(bins * x + y).reshape((bins, bins))

All in all, this provides quite some performance over what you are currently doing:

In [78]: %timeit make_table(x, y, bins) # Your first solution

3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [79]: %timeit make_table2(x, y, bins) # Your second solution

443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))

30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You may also want to be aware of np.histogramdd which takes care of both rounding and binning at once, although it's likely going to be slower than rounding and using np.bincount.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值