I am generating many, many contingency tables as part of a project that I'm writing.
The workflow is:
Take a large data array with continuous (float) rows and convert those to discrete integer values by binning (so that the resulting row has values 0-9, for example)
Slice two rows into vectors X & Y and generate a contingency table from them, so that I have the 2-dimensional frequency distribution
For example, I'd have a 10 x 10 array, counting the number of (xi, yi) that occur
Use the contingency table to do some information theory math
Initially, I wrote this as:
def make_table(x, y, num_bins):
ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
for xn, yn in zip(x, y):
ctable[xn, yn] += 1
return ctable
This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.
The fastest python-only optimization I've been able to come up with is this:
def make_table(x, y, num_bins):
ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
reindex = np.dot(np.stack((x, y)).transpose(),
np.array([num_bins, 1]))
idx, count = np.unique(reindex, return_counts=True)
for i, c in zip(idx, count):
ctable[i] = c
return ctable.reshape((num_bins, num_bins))
That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?
Also, here's a benchmarking function.
def timetable(func):
size = 5000
bins = 10
repeat = 1000
start = time.time()
for i in range(repeat):
x = np.random.randint(0, bins, size=size)
y = np.random.randint(0, bins, size=size)
func(x, y, bins)
end = time.time()
print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))
解决方案
The clever trick for representing the elements of np.stack((x, y)) as integers can be made faster:
In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))
109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [94]: %timeit bins*x + y
12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Moreover, the last part of your second solution can be simplified somewhat, by simply considering
np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
What is more, since we are dealing with equally spaced non-negative integers, np.bincount will outperform np.unique; with that, the above boils down to
np.bincount(bins * x + y).reshape((bins, bins))
All in all, this provides quite some performance over what you are currently doing:
In [78]: %timeit make_table(x, y, bins) # Your first solution
3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [79]: %timeit make_table2(x, y, bins) # Your second solution
443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))
30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
You may also want to be aware of np.histogramdd which takes care of both rounding and binning at once, although it's likely going to be slower than rounding and using np.bincount.