编辑:也许简单的数字映射可以更快,而且没有冲突:import hashlib
from numpy import array
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
numbers = range(0, len(features))
num2string = dict(zip(numbers, features))
string2num = dict(zip(features, numbers))
# read the result
for i in num2string:
print "%i => '%s'" % (i, num2string[i])
print "usage test:"
print string2num['oklahoma']
print num2string[string2num['oklahoma']]
您将得到数组中每个项目的简单数字序列:
^{pr2}$
优点:简单快捷
缺点:如果改变同一字符串在数组中的位置,将得到不同的数字,这与散列字符串不同。在
哈希的使用
您可以使用一些精心选择的hask算法散列字符串。你必须小心哈希函数的碰撞次数。如果两个数据具有相同的哈希值,则在输入中需要一个重复的数字。在本例中,md5哈希函数用于:import hashlib
from numpy import array
def string_to_num(s):
return int(hashlib.md5(s).hexdigest(), 16)
features = array(['oklahoma', 'florida', 'idaho', 'pennsylvania', 'alabama','washington'], dtype=object)
# hash those strings
features_string_for_number = {}
for i in features:
hash_number = string_to_num(i)
features_string_for_number[hash_number]=i
# read the result
for i in features_string_for_number:
print "%i => '%s'" % (i, features_string_for_number[i])
print "usage test:"
print string_to_num('oklahoma')
print features_string_for_number[string_to_num('oklahoma')]
哈希部分取自here。在