之前由于项目中用到了大规模的数据处理,使用了哈希函数作为应用,在此做了些工作将一些哈希( hash )函数的性能和冲突概率进行了测试、总结,并给出了推荐的几种较好的字符串哈希函数。
哈希的目的即将原有的长字符串压缩为32位、64位、128位的哈希编码存储,以节省存储空间。而在这个过程中,起重要作用的便是哈希函数。
在本实验中,采用了常见的一些哈希函数作为对比,并采用了10 million以上(千万级)的较大数据规模进行了测试。
表中的除了最后行表示为时间,其他均为该列哈希函数的冲突概率。
SET 1 :包含大小写字母、数字的,长度为3-12均匀分布,15 million 个样本。
SET 2 :仅包含小写字母的,长度为3-12均匀分布, 15 million 个样本
SET 3 :包含ASCII(32-127)中的常见的字符,长度10-30均匀分布,11 million 个样本。
最后在Release模式下,进行了时间性能测试,即为上表中的最后一行,记录为平均每次哈希(Hash)消耗时间。性能测试,横向来看,差距都并不大,对于哈希函数的选择上,还是应首选冲突小的较好。
测试系统配置:
CPU: AMD 945 X4 MEMORY: 4G SYSTEM: WINDOWS VISTA ULTIMATE (32 bit)
推荐:表中标红的为效果较好的算法,具统计和评论来说BKDR、SDBM、FNV_1 对大规模的字符串哈希来说,有较好的性能表现,推荐使用。同时,如果数据集在 million级以上的话,建议使用64位哈希函数,这样可以有效的避免冲突概率过高的情况。(10Million 上 64位哈希冲突率能到10e-6以下,经过测试)
在下面有全部代码,注释部分为算法的简单摘要,有兴趣的朋友可以去仔细参详下。第一次写BLOG,写的不好请大家指出不足之处。
Code1
unsigned int RSHash(const char *str)2

/**//*3
A simple hash function from Robert Sedgwicks Algorithms in C book.4
*/5


{6
unsigned int b = 378551;7
unsigned int a = 63689;8
unsigned int hash = 0;9
10
while (*str)11

{12
hash = hash * a + (*str++);13
a *= b;14
}15
16
return (hash & 0x7FFFFFFF);17
}18

19
unsigned int PJWHash(const char *str)20

/**//*21
This hash algorithm is based on work by Peter J. Weinberger of AT&T Bell Labs. The book Compilers (Principles, Techniques22
and Tools) by Aho, Sethi and Ulman, recommends the use of hash functions that employ the hashing methodology found in this23
particular algorithm.24
*/25


{26
unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8);27
unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4);28
unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8);29
unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth);30
unsigned int hash = 0;31
unsigned int test = 0;32
33
while (*str)34

{35
hash = (hash << OneEighth) + (*str++);36
if ((test = hash & HighBits) != 0)37

{38
hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits));39
}40
}41
42
return (hash & 0x7FFFFFFF);43
}44

45

46
unsigned int JSHash(const char *str)47

/**//*48
A simple hash function from Robert Sedgwicks Algorithms in C book.49
*/50


{51
unsigned int hash = 1315423911;52
53
while (*str)54

{55
hash ^= ((hash << 5) + (*str++) + (hash >> 2));56
}57
58
return (hash & 0x7FFFFFFF);59
}60

61
unsigned int BKDRHash(const char *str)62


{63
unsigned int seed = 131; // 31 131 1313 13131 131313 etc..64
unsigned int hash = 0;65
66
while (*str)67

{68
hash = hash * seed + (*str++);69
}70
71
return (hash & 0x7FFFFFFF);72
}73

74
unsigned int FNV_1_Hash(const char* str)75

/**//*76
Famous hash algorithm in Unix system, also used by Microsoft in their hash_map implementation for VC++ 2005 77
detail can be found in :http://www.isthe.com/chongo/tech/comp/fnv/#FNV-param78
*/79


{80
unsigned int hash = 2166136261;//offset_basis81
unsigned int prime = 16777619; //FNV_PRIME_3282
while(*str!='\0')83

{84
hash *= prime;85
hash ^= *str++;86
}87
return (hash & 0x7FFFFFFF);88
}89

90
unsigned int FNV_1a_Hash(const char* str)91

/**//*92
Famous hash algorithm in Unix system, also used by Microsoft in their hash_map implementation for VC++ 2005 93
detail can be found in :http://www.isthe.com/chongo/tech/comp/fnv/#FNV-param94
*/95


{96
unsigned int hash = 2166136261;//offset_basis97
unsigned int prime = 16777619; //FNV_PRIME_3298
while(*str!='\0')99

{100
hash ^= *str++;101
hash *= prime;102
}103
return (hash & 0x7FFFFFFF);104
}105

106
unsigned int DJBHash(const char *str)107

/**//*108
An algorithm produced by Professor Daniel J. Bernstein and shown first to the world on the 109
usenet newsgroup comp.lang.c. It is one of the most efficient hash functions ever published.110
*/111


{112
unsigned int hash = 5381;113
114
while (*str)115

{116
hash += (hash << 5) + (*str++);117
}118
119
return (hash & 0x7FFFFFFF);120
}121

122
unsigned int DJB_2_Hash(const char* s)123


{124
unsigned int hashvalue = 5381;125
while(*s!='\0')126

{127
hashvalue = hashvalue * 33^(*s);128
s++;129
}130
return (hashvalue & 0x7FFFFFFF);131
}132

133
unsigned int SDBM_Hash(const char *str)134

/**//*135
This is the algorithm of choice which is used in the open source SDBM project. 136
The hash function seems to have a good overall distribution for many different data 137
sets. It seems to work well in situations where there is a high variance in the MSBs of the138
elements in a data set.139
*/140


{141
unsigned int hash = 0;142
143
while (*str)144

{145
// equivalent to: hash = 65599*hash + (*str++);146
hash = (*str++) + (hash << 6) + (hash << 16) - hash;147
}148
149
return (hash & 0x7FFFFFFF);150
}151

152
unsigned int APHash(const char *str)153

/**//*154
An algorithm produced by me Arash Partow.155
*/156


{157
unsigned int hash = 0;158
for (int i=0; *str; i++)159

{160
if ((i & 1) == 0)161

{162
hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));163
}164
else165

{166
hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5)));167
}168
}169
170
return (hash & 0x7FFFFFFF);171
}

被折叠的 条评论
为什么被折叠?



