python权重分类数组_从python中的权重数组获取随机索引的快速方法

I regularly find myself in the position of needing a random index to an array or a list, where the probabilities of indices are not uniformly distributed, but according to certain positive weights. What's a fast way to obtain them? I know I can pass weights to numpy.random.choice as optional argument p, but the function seems quite slow, and building an arange to pass it is not ideal either. The sum of weights can be an arbitrary positive number and is not guaranteed to be 1, which makes the approach to generate a random number in (0,1] and then substracting weight entries until the result is 0 or less impossible.

While there are answers on how to implement similar things (mostly not about obtaining the array index, but the corresponding element) in a simple manner, such as Weighted choice short and simple, I'm looking for a fast solution, because the appropriate function is executed very often. My weights change frequently, so the overhead of building something like an alias mask (a detailed introduction can be found on http://www.keithschwarz.com/darts-dice-coins/) should be considered part of the calculation time.

解决方案

Cumulative summing and bisect

In any generic case, it seems advisable to calculate the cumulative sum of weights, and use bisect from the bisect module to find a random point in the resulting sorted array

def weighted_choice(weights):

cs = numpy.cumsum(weights)

return bisect.bisect(cs, numpy.random.random() * cs[-1])

if speed is a concern. A more detailed analysis is given below.

Note: If the array is not flat, numpy.unravel_index can be used to transform a flat index into a shaped index, as seen in https://stackoverflow.com/a/19760118/1274613

Experimental Analysis

There are four more or less obvious solutions using numpy builtin functions. Comparing all of them using timeit gives the following result:

import timeit

weighted_choice_functions = [

"""import numpy

wc = lambda weights: numpy.random.choice(

range(len(weights)),

p=weights/weights.sum())

""",

"""import numpy

# Adapted from https://stackoverflow.com/a/19760118/1274613

def wc(weights):

cs = numpy.cumsum(weights)

return cs.searchsorted(numpy.random.random() * cs[-1], 'right')

""",

"""import numpy, bisect

# Using bisect mentioned in https://stackoverflow.com/a/13052108/1274613

def wc(weights):

cs = numpy.cumsum(weights)

return bisect.bisect(cs, numpy.random.random() * cs[-1])

""",

"""import numpy

wc = lambda weights: numpy.random.multinomial(

1,

weights/weights.sum()).argmax()

"""]

for setup in weighted_choice_functions:

for ps in ["numpy.ones(40)",

"numpy.arange(10)",

"numpy.arange(200)",

"numpy.arange(199,-1,-1)",

"numpy.arange(4000)"]:

timeit.timeit("wc(%s)"%ps, setup=setup)

print()

The resulting output is

178.45797914802097

161.72161589498864

223.53492237901082

224.80936180002755

1901.6298267539823

15.197789980040397

19.985687876993325

20.795070077001583

20.919113760988694

41.6509403079981

14.240949985047337

17.335801470966544

19.433710905024782

19.52205040602712

35.60536142199999

26.6195822560112

20.501282756973524

31.271995796996634

27.20013752405066

243.09768892999273

This means that numpy.random.choice is surprisingly very slow, and even the dedicated numpy searchsorted method is slower than the type-naive bisect variant. (These results were obtained using Python 3.3.5 with numpy 1.8.1, so things may be different for other versions.) The function based on numpy.random.multinomial is less efficient for large weights than the methods based on cumulative summing. Presumably the fact that argmax has to iterate over the whole array and run comparisons each step plays a significant role, as can be seen as well from the four second difference between an increasing and a decreasing weight list.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值