python 关联表_Python列联表

最新推荐文章于 2024-06-07 15:55:52 发布

weixin_39949584

最新推荐文章于 2024-06-07 15:55:52 发布

阅读量359

点赞数

文章标签： python 关联表

I am generating many, many contingency tables as part of a project that I'm writing.

The workflow is:

Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)

Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution

For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur

Use the contingency table to do some information theory math

Initially, I wrote this as:

def make_table(x, y, num_bins):

ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))

for xn, yn in zip(x, y):

ctable[xn, yn] += 1

return ctable

This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.

The fastest python-only optimization I've been able to come up with is this:

def make_table(x, y, num_bins):

ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))

reindex = np.dot(np.stack((x, y)).transpose(),

np.array([num_bins, 1]))

idx, count = np.unique(reindex, return_counts=True)

for i, c in zip(idx, count):

ctable[i] = c

return ctable.reshape((num_bins, num_bins))

That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?

Also, here's a benchmarking function.

def timetable(func):

size = 5000

bins = 10

repeat = 1000

start = time.time()

for i in range(repeat):

x = np.random.randint(0, bins, size=size)

y = np.random.randint(0, bins, size=size)

func(x, y, bins)

end = time.time()

print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))

解决方案

The clever trick for representing the elements of np.stack((x, y)) as integers can be made faster:

In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))

109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [94]: %timeit bins*x + y

12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Moreover, the last part of your second solution can be simplified somewhat, by simply considering

np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

What is more, since we are dealing with equally spaced non-negative integers, np.bincount will outperform np.unique; with that, the above boils down to

np.bincount(bins * x + y).reshape((bins, bins))

All in all, this provides quite some performance over what you are currently doing:

In [78]: %timeit make_table(x, y, bins) # Your first solution

3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [79]: %timeit make_table2(x, y, bins) # Your second solution

443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))

307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))

30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You may also want to be aware of np.histogramdd which takes care of both rounding and binning at once, although it's likely going to be slower than rounding and using np.bincount.